The PRECC Compiler Compiler

1 downloads 0 Views 232KB Size Report
A parser for the Oberon-2 language was compiled by all the compiler-compilers, using as a basis a YACC specification1 following the definition given in 14].
The PRECC Compiler Compiler Peter T. Breuer & Jonathan P. Bowen  Oxford University Computing Laboratory 11 Keble Road, Oxford OX1 3QD. fPeter.Breuer,[email protected]

Abstract

PRECC is a UNIX utility that has been designed to extend the capabilities of the familiar LEX and YACC front-end design and implementation tools. The utility is a compiler-compiler that takes unlimited lookahead and backtracking, the extended BNF notation, and parametrized grammars with (higher order) meta-parameters to the world of C programming. The generated code is standard ANSI C and is `plug compatible' with LEX-generated lexical analysers prepared for YACC. In contrast to YACC, however, the generated code is modular and thus allows parts of scripts to be compiled separately and linked in incrementally. But it remains ecient in practice, and the generated code has run as fast or faster than YACC-generated code in trials with real programming language speci cations.

1 Introduction YACC [7] is well established as a general purpose compilercompiler, almost invariably still the tool of choice in the development under UNIX systems of o -line applications with complex input formats, but it is beginning to show its age. Issues which were important when hardware resources were scarce are now generally less critical. To name but one possibility, it may have become justi able to trade o tight bounds in run-time memory against maintainability or expressiveness in the speci cation language itself. Yet LALR(1) automaton-based parser generators like YACC have continued to dominate the compiler-compiler scene, despite (or because of) their venerable pedigree. Certainly, a nite state  Funded by the UK Information Engineering Directorate SAFEMOS project (IED3/1/1036).

1

automaton does have a precisely delimited run-time loading, which can be comforting, but there are some more negative consequences of the technology. Firstly, and notably, both LEX and YACCgenerated C code is monolithic, and a large language description, like that for COBOL [16], with over two hundred keywords, generates such a large virtual automaton that swapping problems may occur at runtime even on modern workstations. This can result in a poor perceived performance for the application to which the parser acts as front end, no matter how good the application code itself may be. Secondly, the C compiler itself sometimes has problems compiling the generated code, containing as it does a very large case statement which must be optimized to an ecient form by the compiler, and this makes for slow compilation and a long turn-round time when it comes to debugging or altering a speci cation. The monolithic code format forces everything to be recompiled when just a single change is needed, and this can be a source of frustration for the software engineer seeking to use a speci cation-driven utility in order to cut development or maintenance time. Thirdly, numerous `shift/reduce' clashes may be expected in a YACC script, and YACC will issue warnings for these at compiletime. This phase of YACC's activity constitutes a static check for implementability, and it implies that the speci cation language cannot have the independent semantics expected of BNF no matter how much like BNF it may look. Moreover, the detailed clash reports refer to the generated automaton, and can be extremely dicult to relate to what might be wrong with the script. Further, it is frequently the case that some con icts have to be accepted in the nal script, and since these are resolved in the favour of shifts by default, this means that extra (and unspeci ed) grammar productions are introduced. I.e., a YACC script with residual shift/reduce con icts will always generate a parser that is looser than the intended speci cation. It can, all in all, be very frustrating for engineers hoping to use the compiler-compiler for rapid prototyping, because time has to be spent debugging the generated automaton as well as the script itself. Nevertheless, on the face of it, YACC o ers some attractive programming options. It accepts a semantically complete variety of the BNF grammar speci cation language, and compiles the high level de nition scripts into a C program [9], which is a very portable approach. In practice, however, the drawbacks may be more apparent. The supported BNF is an impoverished syntactic subset, for example, every BNF construct having to be expanded out by hand into the basic `sequential' and `alternation' components before it can be incorporated into a YACC script, which obscures the speci cation and makes maintenance dicult. This is true too of 2

# include "ccx h" # define INIT V(1) = 0; # define STEP V(1) = V(0) + 1; # define TOTL printf("%d terms OKnnn Next terms are %d %d nn" V(0) a b); :

;

; ::

;

;

;

( ) @fibs = f : INIT : fib(1 1) $! g @fib(a b) = number(a) h','i : STEP : fib(b a+b) @ j h'.'i h'.'i : TOTL : @number(n) = digit(n) @ j digit(HEAD(n)) number(TAIL(n)) @digit(n) = hn+'0'i MAIN fibs

;

?

;

;

Figure 1(a): A PRECC parser which accepts only the Fibonacci sequence as input and pinpoints errors to within a character. (HEAD and TAIL are C macros.) 1; 1; 2; 3; 5; :: 5 terms OK Next terms are 8; 13; :: 1; 1; 2; 3; 5; 8; 13; 21; 34; 51; 85; ::

(line 2 error) failed parse : probable error at hi1; 85; :: Figure 1(b): Two input lines and responses.

3

more modern implementations of YACC and also BISON (a similar, widely distributed utility available from the GNU Software Foundation, which use improved internal algorithms (see [17], for example) but possess essentially the same scripting language and functionality. It is also the case that BNF on its own is simply not expressive enough to capture the syntax of many programming languages. For example, the extended BNF-style speci cation in Figure 1 speci es a parser which only accepts an initial segment of the Fibonacci sequence (1, 1, 2, 3, 5, 8, 13, . . .) as input. This would be impossible to specify in non-extended BNF, and therefore very hard to program in YACC, where the closest approximation would be a list of numbers: b : '.' '.' j NUMBER ',' b ; Extra syntax constraints to make the series conform to a Fibonacci series can be programmed into the YACC script using side-e ecting actions embedded in attached C code, but the result is inevitably a speci cation that has lost its comprehensibility and its maintainability. One approach to improving on YACC to parse the non-LR(k) grammars (which require in nite-lookahead) would be to modify YACC itself. This is the approach taken in [13] in order to handle C++ [19] adequately, by allowing dynamic stack allocation, but it is not very elegant. Even the original author of YACC has produced a prototype tool, Y++, to handle C++, using the notion of attribute grammars [8]. Another approach, as described in this paper, is to produce a considerably more versatile tool from scratch, and design it to accept higher level grammar descriptions. More modern languages such as C++, which require more complicated parsing than older languages, may be the key to the widespread acceptance of improved parser technology, although both issues drive each other. The authors' particular need for the utility described in this paper, PRECC, has come through the requirement to provide tools for the inter-conversion of concrete programming and speci cation languages. Existing utilities simply were neither powerful nor convenient enough to do the job. The PRECC speci cation in Figure 1 illustrates the point. PRECC allows the designer to write scripts in extended BNF notation, using parametrized grammars if necessary. The designer may make use of attributes dynamically attached to or picked up from the parsed phrases in the succeeding parts of the parse. This makes the speci cation language much more powerful, and also enables the user to write `macros' for commonlyused constructions. For example, one may expect to express a comma-separated list of foo's as a YACC script: 4

phooey : foo j foo ',' phooey ; having replaced foo by the particular non-terminal required. This construction has to be repeated everywhere a comma-separated list is required within the script; where lists of variables are required, lists of expressions, lists of array dimensions, and so on. In PRECC, the following `macro' can be de ned: @ comma list(x) = x h','i comma list(x) @ j x (the longest match must come rst in the list of alternates in PRECC) and then used with the appropriate parameters in the appropriate position in the script: @ phooey = comma list(foo) @ expr list = comma list(expr) @ const list = comma list(const) ... Moreover, using YACC, the intended de nitions of expr list and const list: expr list : expr j expr ',' expr list ; const list : const j const ',' const list ; will provoke clashes both because of the one-token lookahead semantics (1TLA) and the LR (bottom-up) implementation of the automaton. One problem is that constants are a subclass of expressions, so YACC will not inherently be able to know from the rst token in a list which production it is processing, expr list or const list. Nor, if lists of expressions and constants appear in several contexts, will it know which production to jump to (`reduce to'). Nor, if some other productions call for only a single expression, and not a list, (as in the right hand side of a simple assignment, for example), will YACC know whether, after the rst token, to continue looking for more tokens in order to build a list of expressions (`shift'), or to jump immediately into the production with what it has (`reduce'). Ambiguities like this will cause YACC to generate many warnings at the time that it generates code, but it is extremely dif cult to avoid writing scripts which provoke such warnings, because the scripts which are ambiguous from the point of view of 5

YACC's bottom-up semantics are the most natural. On the other hand, PRECC disambiguates productions both by context and by in nite lookahead, so it can be expected to cope correctly with scripts which are written in a natural and fairly free top-down style. PRECC's semantics is declarative, and every expression and production in a PRECC script has a meaning that is independent of whatever else may be in the script. Furthermore, under PRECC, each production (for example, each of those above) can be placed in di erent scripts. Each script can then be compiled to a separate module, and the modules can then linked together to form the complete front end. So it would be possible to change the de nition of phooey for example, and regenerate the front end object code without having to regenerate the monolithic script, just the section relating to phooey. This cuts maintenance turn-round time and makes servicing much easier, because modules can be replaced individually without the whole front end having to be taken out of service. In the rest of this paper, we will set out the programming language and facilities that PRECC o ers. First, however, we give the results of some actual tests against YACC.

2 Trials

Comparative tests were undertaken between PRECC and GNU BISON (on a 16MHz 386sx PC under MSDOS 5.0) and YACC (on a HP 9000s300 workstation running HP-UNIX) and the results are shown in Table 1. A parser for the Oberon-2 language was compiled by all the compiler-compilers, using as a basis a YACC speci cation1 following the de nition given in [14]. Although many other languages have been treated by PRECC, they have for the most part been without YACC speci cations, and Oberon-2 is the rst complete language for which an existing (and working) YACC speci cation has been used as a basis for a PRECC speci cation. As few changes as possible were made (the resulting speci cation is better tuned to YACC than to PRECC) and all parsers used the same 70 token lexer, compiled by the `fast lex' utility, FLEX. For example, the following fragment of the PRECC script: @ Designator Base = Quali ed Ident @ [ Designator Record Ref @ j Designator Array Ref @ j Designator Indirection @ j Designator Guard @ ] 1 Speci cation supplied by T. Channon ([email protected]). YACC speci cation originally by Stephen J. Bevan ([email protected]) and c 1991 Department of Computer Science, University of Manchester. copyright

6

corresponds to the following original in the YACC script: Designator Base : Quali ed Ident j Quali ed Ident Designator Record Ref j Quali ed Ident Designator Array Ref j Quali ed Ident Designator Indirection j Quali ed Ident Designator Guard ; Both YACC and BISON were able to read and compile the original YACC script, which therefore appears to be fairly standard. All the compiled parsers di er only slightly in behaviour. On a suite of 33 standard Oberon test scripts supplied with the original YACC speci cation, all are accepted by the BISON and YACC parser, and 30 are accepted and 3 rejected by the PRECC parser. One of the rejected scripts (syn7.ob) contains the error:

= 3; Simultaneous constant declarations are not allowed in Oberon. In the second script (syn26.ob) rejected by the PRECC parser, the constructions a(a) c(x); a(a) c[x y]; are the cause. `a(a) c' ought to be accepted (a record reference o a function result), and so also ought `a c(x)' (a quali ed function call) and `a c[x y]' (a quali ed array reference), but mixtures ought to have been rejected as nonsense. The third script (syn30.ob) contains the error CONST x; y

: :

;

:

:

:

;

x

:= f10 |

: :

15 {z

: :

19}; 20; 30

invalid set element

g;

The reason for the acceptances by the YACC and BISON parsers is the occurrence of 5 shift/reduce con icts within their scripts. The CONST acceptance is caused by a shift into the list state when a list is not speci ed by the script; and in bogus quali cations, it seems, are caused by shifts where the script wants `reduce', with the result that an invalid construction is wrongly accepted. The translated PRECC script does not pick up this artifact of the YACC-style implementations, and hence (correctly) rejects the unspeci ed constructions. The set element error is also the kind of problem that might be expected from a shift instead of a reduce. There are 358 states in the YACC/BISON compilation and 332 intermediate functions in the PRECC compilation (a measure of 7

Environment 16MHz 386sx MSDOS 5.0 Utility PRECC BISON Source oberonp2.y oberon 2.y Size (bytes) 21599 19477 Output oberonp2.c y tab.c Size (bytes) 63829 44615 Compile time (secs) 21.53 27.02 Shift/reduces n/a 5 Run time (secs) 12.03 9.51 Errors detected 3 0

HP 9000s300 HP{UX PRECC YACC oberonp2.y

oberon 2.y

oberonp2.c

y.tab.c

20534 61170 7 n/a 15 3

18223 50307 11 5 9 0

Table 1: Comparative gures for compilation and execution (on a 33 script test suite) of Oberon-2 by PRECC and GNU BISON and UNIX YACC. (Timings include system and execution time.) the additional complexity introduced by PRECC over and above that in the script itself). As a test of the compiled executable, the 33 test scripts were passed through the parsers in sequence. There are a total of 394 lines in the scripts. BISON/YACC code is noticeably quicker to run, but the speci cation was written for BISON, not PRECC, and the translated script for PRECC contains forced backtracks (i.e., some poorly crafted productions). In particular, the three error scripts where errors are detected will have slowed the PRECC parser down through the extra backtracking searches for alternate parses, as well as generating more output error messages. In conclusion, in the tests, PRECC appeared to be about 30% faster at compiling and the compiled parsers about 30% slower than YACC or BISON equivalents, but the factors discussed above tended to bias the results in favour of the YACC implementations, which in any case failed to detect several deliberate errors in the test scripts.

3 The declarative approach to parsing Developments in declarative languages such as logic (e.g., Prolog) and functional (e.g., ML) languages have led to the incorporation of declarative parser (that is, compiler) generators into the associated development environment, and these allow more expressive use of BNF-style descriptions than YACC, and thus allow more languages to be expressed precisely. When these declarative environments are compilable, the results from their built-in compiler-compiler facilities are very acceptable relative to YACC non-functional overheads, and much more acceptable from the point of view of the conve8

nience and succinctness, and accuracy of the de nitional language itself. Indeed the op(. . . ) pseudo-clause of Prolog allows enough

exibility in the de nition of pre/in/post- x operators, together with their associativity and precedence, for an abstract tree input to a compiler to be suciently readable that it is usable directly, at least for rapid-prototyping purposes [3]. Unfortunately, higher order language suites usually compile into native machine code, with a consequent lack of portability between di erent architectures, and usually carry the considerable overhead of at least some part of the development environment with them into the nal product. Even if the overhead is not considerable in terms of machine loading, then it still may be considerable in nancial terms; and there is always some nagging doubt about the e ectiveness of the essentially recursive code, no matter how good the optimization is, because of the number of layers of translation involved. PRECC has been designed to implement just enough of a (compiled) functional language in order to be able to handle parsing and no more. The `secret' (though hardly a secret to many functional programmers) behind all declarative compilers and parsers is top-down recursive descent parsing [6] and LL(1) in nite look-ahead technology. Heavily recursive in execution, the parser de nitions can nevertheless be made as elegant as one cares to make them. In a functional language, all that has to be done is to implement the higher order combinators that make compound parsers out of simpler component parsers. The combinators correspond to the operators and separators in BNF notation, so the gap between the speci cation language (BNF) and the implementation as a working parser can be tiny { especially if the functional language allows the direct binding of the appropriate mix x symbols to these higher order functions. This method is set out with memorable clarity by Jon Fairbairn in [6], but runtime considerations do not gure heavily in the account. The functional method turns out to transfer quite easily and naturally to C, and the generated C code follows the form of the intended grammar closely, providing a natural modularity. Indeed, the functional model, as implemented by PRECC, o ers a way round the morass of implementation compromises, as exempli ed by YACC's `shift/reduce' errors: all BNF constructs are implemented correctly in the functional model and therefore they are implemented correctly in PRECC, so there can be no implementation con icts to report. PRECC validly re nes the recursion equations which hold in the functional programming model (there is some extra strictness introduced, but not in the higher-order combinators themselves). PRECC also allows annotations to be placed in the speci ca9

tion script which enable better code to be generated, and thus circumvent the need for a garbage collector or over-conservative use of the stack. The cut marks `!' may be used to indicate points through which PRECC should not backtrack (presumably because it is known to the speci cation designer that no alternative parses exist for words of the language with the initial segment that it has seen till this point) and therefore points at which the stack can be jettisoned safely. This cuts space requirements, and also time requirements in case of error, because no alternatives need be sought. For example, the following script speci es a language consisting purely of a's and b's: @ abs = h'a'i abs @ j h'b'i abs @ j $ but it can be annotated with cut marks as follows: @ abs = h'a'i ! abs @ j h'b'i ! abs @ j $! and then the generated code ought to use only a constant space overhead instead of an amount linearly dependent on the input word length. In one practical matter too, PRECC seems more amenable than YACC. It is far easier to debug, simply because the C code produced both follows the structure of the BNF source closely, and is conserved through re-compilations. Local changes in the source force only local changes in the generated C code { another aspect of modularity, so breakpoints and the like can be relocated quickly.

4 Performance Modularity and higher order de nitions were intended to compensate for the de ciencies in execution speed with respect to YACC. After all, most of the e ort in writing languages does go into maintenance of one kind or another, so it is reasonable to trade o execution speed against maintenance time, especially as speed is a function of hardware nowadays; but, surprisingly, PRECC-generated executables turned out to run at least as fast as those generated by YACC. There is no disadvantage discernible for the parsers that have been constructed (in fact, a considerable advantage is seen). It may be that LL parsers have got a bad press from old technology, and that nowadays, their capabilities ought to be re-evaluated. With hindsight, this can be attributed to the fact that C compilers implement subroutine calls, and therefore recursions, very eciently nowadays. Presumably this was not altogether the case 10

at the time that LEX and YACC were becoming popular, which is curious, because UNIX environments ought to provide an environment in which large call stacks are invisible both to the user and the programmer, being handled by the operating system through swapping and virtual memory if necessary. The space overhead claimed by PRECC is not even very great: 32KBytes of call stack is required to process 40-deep nested constructions in Occam 2 [4] (and 40-deep is the maximum in Occam because there are eighty characters to a line, and each nesting must be indented by at least two characters from the last), corresponding to about 2,500 stacked function calls from the kernel. PRECC executables also turn out to be smaller than YACCgenerated ones, with no sign of swapping problems. Again, with hindsight, this can be attributed to the higher-order and modular nature of PRECC output code, which ensures that fewer subroutines are necessary, and that subroutines have small bodies which never approach a memory page or even a memory cache in size.

5 Complexity and expressiveness PRECC itself is known to be of linear complexity with respect to input de nition scripts. It runs at thousands of lines of input per second on standard platforms such as Sun 3 or 4 and HP9000 workstations, and the output code will compile under any ANSIcompliant C compiler. In theory too, its generated parsers (for non left-recursive speci cations) induce an overhead in space that is at most linear in the number of tokens received (between the marked `cut' points in the speci cation script). Theoretically too, it can express any language L for which there is a decision algorithm inL for membership: inL(w) i !inL(w) i

w w

2L 62 L

because the parser L speci ed by: @L = T(emptybuffer) @ T(w) = )inL(w)( $ @ j h'a'i T(strcat(w,"a")) ... @ j h'z'i T(strcat(w,"z"))

can be proved to accept precisely the words whEOLi for which inL(w) is true, and reject the words whEOLi for which inL(w) is false, and to do so using at most jAjjwj extra units of memory space 11

over that required for inL(w), where the alphabet A consists of the range of tokens A=f'a',. . . ,'z'g. This is a somewhat clumsily phrased speci cation, and there are ways to make it cleaner. The built-in incoming token bu er yybuffer can be used as the reference to the words w, which saves using a parametrized grammar and allows the explicit token matches to be skipped: @L = T @ T = )inL(yybuffer)( $ @ j ?T and this speci cation takes at most 2jwj units of space to accept or reject a word whEOLi. If a non-side-e ecting speci cation is required, then (version 3.40 and above of) PRECC can pick up the value of the tokens from the general token match `?', using the notation `?nx: @L = T(emptybuffer) @ T(w) = )inL(w)( $ @ j ?nx T(strncat(w,&x,1)) and the result is an entirely declarative speci cation, but with the same space complexity. The time complexity of the shorter speci cations is better than that of the longer version, because, with the latter, on average more alternative productions will have to be tried before a word w is parsed successfully. Nevertheless, the speci cations above do demonstrate that PRECC is capable of parsing any language that may be de ned algorithmically. In this respect, PRECC ful ls one of its design goals. But quite apart from questions of expressive power, its time and space complexity is still low enough that it has been successfully used to make full scale parsers and scanners for COBOL 74 [16], and to implement the programming language Uniform [18]. The de nition scripts for these languages comprise between one and two thousand lines, involving between one and two hundred parser de nitions. The number of keywords alone in COBOL is over one hundred; and the resulting executables are ecient in time and space.

6 Literacy and compatibility PRECC permits the embedding of C code and C preprocessor macros or instructions anywhere in a PRECC script, with the resulting mix of `advantages' from PRECC's own referentially transparent higher-order functions and the referentially opaque C ones. In general, PRECC scripts are literate programs in the sense of 12

Knuth [10], in that only certain parts of the script are visible to the utility itself (those lines beginning with an `@'), and these can therefore be embedded in ordinary text (i.e., literate prose) or other computer language instructions, typically C function de nitions and preprocessor directives. PRECC also provides a simple hook for tokenizing pre- lters such as the UNIX LEX [7] utility, to which it presents an interface that is intended to be `plug compatible' with YACC's. Most lexers aimed at a YACC should be usable as they stand with PRECC, because PRECC calls a lexer called yylex() each time it wants a new token, just as YACC does. This mechanism can be circumvented by allowing the lexer to write directly into PRECC's bu er, yybuffer, and while there are tokens in it, PRECC will read ahead from this area instead of calling yylex() again. Most YACC scripts can be easily converted to PRECC scripts via simple lexical changes and inversion of the order of alternatives within a production. There are some diculties with YACC's %precedence operator, but operator precedences can be coded explicitly in dicult cases.

7 Disadvantages What disadvantages PRECC does have relate chie y to (1), at runtime, its use of the C stack { which can grow large in deeply structured and recursive grammar descriptions, and which is a limitation (of C) under some memory models on restricted architectures, such as an 8086-based PC with its 64KByte segment limit (stack growth can be restricted by the use of `cuts' in the de nition script, however) { and (2) in scripts, an as yet inadequate facility for de ning local (volatile) data structures for use as attached synthetic attributes or in the actions attached to PRECC scripts, but this is also an inheritance from the compatibility with YACC. The memory required should not ordinarily be greater than about 64K bytes of stack per 80 (uncut) tokens parsed. Figures much better than this have been achieved for practical PRECC parsers [4]. It is true that an in nitely recursing parser will draw on all the memory available before halting, however, and this is a problem, because speci cations that give rise to such recursions cannot in general be detected algorithmically, because of the power of PRECC's descriptive language. It is possible to specify a parser that will read a programming language script, for example, and only terminate if the parsed code will terminate correctly on execution. It is possible to restrict oneself to the LL(1) subset of grammars, for example, in order to be certain that PRECC will gener13

ate a non-looping parser. But no mechanism has been built into PRECC itself to restrict the acceptable speci cations in this way, nor does it seem desirable to do so. Static analysis utilities seem much better suited to the job of advising the user on the implementational aspects of the speci cation. Appendix B of this paper contains an exact semantics for PRECC parsers which may be used as a basis for such tools.

8 The complete PRECC syntax An outline of the complete PRECC language is shown in Figure 2, using an unsophisticated PRECC script that may be also be suitable for bootstrapping using a simpler parser generator. White space and comments may be assumed to have been already removed by the tokenizer, and lines not beginning with the `@' symbol to have been discarded. The `C rvalue' in the description of predicate is intended to denote either a function name, or a function name followed by a bracketed list of C expressions { a function application, like foo(x,y,1+z). PRECC will invoke the function with the current token as an extra parameter { in this example, foo(x,y,1+z,tok) { and expects a boolean value to be returned. PRECC does perform some analysis of C expressions. Firstly, C code within actions must be searched for $n references (to synthetic attributes), and these pseudo-variables are then replaced by the appropriate C code. Secondly, and more prosaically, PRECC has to `understand' C expressions in order to determine where they end. The syntax described in the gure is highly ambiguous. The following is perfectly legal PRECC syntax (a sequence of two conditions surrounding a parser b), for example: @ a = )foo(TRUE)( b )(bar)( but it is only prevented from being interpreted as two conditions surrounding a parser TRUE: @ a = )foo( TRUE )(b)(bar)( by recognition of foo(TRUE) as the maximal initial segment of the parse interpretable as a C expression. It can be seem that extra care is needed when using C macros.

9 Summary PRECC is a compiler-compiler based on a powerful higher order programming model and produces modular and highly portable and maintainable C code. All the standard BNF constructs are 14

@ script = declaration? @ declaration = lvalue h'='i expression @ lvalue = parserid [ h'('i formals h')'i ] @ formals = varid f h','i varid g? @ expression = alternate f h'j'i alternate g? @ alternate = [ sequent f [ h'n'i varid ] sequent g? ] @ sequent = sequent0 [ postop ] @ sequent0 = h'h'i C expression h'i'i @ j h'i'i C expression h'h'i @ j h')'i C expression h'('i @ j h'@'i C expression @ j h'('i C rvalue h')'i @ j h':'i C code h':'i @ j h'['i expression h']'i @ j h']'i expression h'['i @ j h'f'i expression h'g'i @ j h'?'i @ j h'$'i @ j h'!'i @ j rvalue @ postop = f h'+'i j h'?'i g [ C expression ] @ rvalue = parserid [ h'('i actuals h')'i ] @ actuals = C expression f h','i C expression g?

/* literal */ /* not a literal */ /* condition */ /* attribute */ /* predicate */ /* action */ /* option */ /* phantom */ /* bracket */ /* any token */ /* EOL */ /* cut */ /* atomic */

Figure 2: PRECC syntax summary.

15

built into the grammar de nition language, and more may be de ned using PRECC's own `macros'. The utility has been used to generate parsers and compilers for both commercially available (COBOL [16] and Occam [4]) and experimental (Uniform [18]) languages. A comparative study with YACC [7] and GNU's BISON clone has been undertaken using Oberon-2 [14] as the input grammar. PRECC also uses itself to parse its own input speci cation and compile its own code. PRECC has been designed to lift some of the programming and implementational restrictions imposed by compiler-compilers in the style of the UNIX YACC utility. The design notably provides in nite lookahead and backtracking in place of YACC's one-token lookahead, and a declarative semantics. Scripts can contain arbitrarily complex BNF expressions, and these may take parameters. When the parameters are other grammar descriptions, the scripted clause behaves like a referentially transparent higher order `macro', which helps achieve clearer coding and maintainability. Scripts may in any case be divided up into independent modules, and compiled separately and linked together independently in order to cater for improved change management. PRECC has a formal basis that is fully investigated in [5] for those who wish to nd out more about foundations of the tool.

References [1] A.Y. Aho, The theory of parsing, translation, compiling, New Jersey, USA, Vol. 1 & 2, 1972{73. [2] A.V. Aho and J.D. Ullman, Principles of compiler design, Addison-Wesley Publishing Company, 1977. [3] J.P. Bowen, From programs to object code using logic and logic programming, in R. Giegerich and S.L. Graham (eds.), Code Generation { Concepts, Tools, Techniques, Proc. International Workshop on Code Generation, Dagstuhl, Germany, 20{24 May 1991. Springer-Verlag, Workshops in Computing, pp. 173{192, 1992. [4] J.P. Bowen and P.T. Breuer, Occam's Razor: the cutting edge of parser technology, in Proc. TOULOUSE 92: Fifth Interna-

tional Conference on Software Engineering and its Applications, Toulouse, France, 7{11 December 1992. [5] P.T. Breuer and J.P. Bowen, A PREttier Compiler-Compiler: Generating higher order parsers in C, Programming Research

Group Technical Report PRG-TR-20-92, Oxford University Computing Laboratory, UK, 1992. 16

[6] J. Fairbairn, Making form follow function: An exercise in functional programming style, Software|Practice and Experience, 17(6), pp. 379{386, 1987. [7] S.C. Johnson and M.E. Lesk, Language development tools, The Bell System Technical Journal 57(6), part 2, pp. 2155{ 2175, July/August 1978. [8] S.C. Johnson, Yacc meets C++, in UNIX around the World, Proc. Spring 1988 EUUG Conference, pp. 53{57, 1988. [9] B.W. Kernighan and D.M. Ritchie, The C programming language, 2nd edition, Prentice-Hall Software Series, 1988. [10] D.E. Knuth, Literate programming, The Computer Journal, 27(2), pp. 97{111, May 1984. [11] B.B. Kristensen and O.L. Madsen, Methods for computing LALR(K) lookahead, ACM TOPLAS, 3(1), pp. 60{82, 1981. [12] J.L. Lewi, K. de Blaminck, I. van Horebeek and E. Steegmans, Software development by LL(1) syntax description, John Wiley & Sons Ltd, 1992. [13] G.H. Merrill, Parsing non-LR(k) grammars with yacc, SAS Institute Inc., SAS Campus Drive, Cary, NC 27513, USA, 1992. Submitted to Software|Practice and Experience. [14] H. Mossenbock and N. Wirth, The programming language Oberon-2, Institut fur Computersysteme, ETH Zurich, Switzerland, January 1992. [15] J.C.H. Park and K.M. Choe, Remarks on recent algorithms for LALR lookahead sets, ACM Sigplan Notices, 22(4), pp. 30{32, April 1987. [16] A. Parkin, COBOL for students, Edward Arnold, London, 1984. [17] S. Sippo, Parsing theory, Vol. 15, EATCS Monographs on Theoretical Computer Science, Springer-Verlag, 1988. [18] C. Stanley-Smith and A. Cahill, UNIFORM: A language geared to system description and transformation, University of Limerick, Ireland, 1990. [19] B. Stroustrup, The C++ programming language, 2nd edition, Addison-Wesley, 1991.

17

A FTP access PRECC is available via anonymous FTP access from the Internet host ftp.comlab.ox.ac.uk (192.76.25.2). The relevant les are: /pub/Programs/preccx.tar.Z

(in compressed \tar" form) for UNIX systems, and under the directory /pub/Programs/preccx.msdos

for use on PC-compatible computers under DOS. The latter directory contains scripts for several di erent languages. There is a nominal charge for supply on a oppy disk instead.

B A trace/refusal semantics We can de ne a trace/refusal semantics for PRECC parsers and this provides the language with which to discuss the behaviour of parsers. The following is a simpli cation of the system given in [5]. Accordingly, let a trace (a b) of a parser p consist of the initial part a of the input token stream absorbed by p (a nite sequence of tokens), and the remaining part b released again to the output stream on success (signalled by a result OK r), and let Tr(p) be the set of these traces: ;

( ) = f(a b) j p(a

Tr p

;

_b

) = (b

; OK

)g

r

In the above, parsers are represented as functions acting explicitly on an input stream to produce an output stream and a result. Similarly, we de ne a refusal a for a parser p to be a (possibly incomplete) sequence of input tokens a which causes an error signal Err e. The input stream is automatically rewound afterwards. ( ) = fa j p(a) = (a

Re p

; Err

e

)g

The set of traces and the set of refusals together determine the parsers behaviour uniquely. Note that it is perfectly feasible that a parser should perform according to the input tokens still to come. In practice, the parser merely has to `look ahead' a little. The following deductions are sound for the auto-rewinding parsers de ned by PRECC speci cations. The concatenation of sequences of tokens is written `a b', the empty sequence of tokens is `[ ]', and a singleto sequence of tokens is `[a]'. The unspeci ed parser (with no known traces or refusals) is written `?': _

18

(a; b_c) 2 Tr(p) (b; c) 2 Tr(q) (a_b; c) 2 Tr(p q) (a; b)2Tr(p) (a_b)2Re(p) (a; b)2Tr(q) (a; b)2Tr(pjq) (a; b)2Tr(pjq) a2Re(p) (a; b)2Tr(p) b2Re(q) a2Re(p q) a2Re(p q) a 2 Re(p) a 2 Re(q) a 2 Re(pjq) a 2 Re(Fn (?)) (a; b) 2 Tr(Fn (?)) (a; b) 2 Tr(p = F(p)) a 2 Re(p = F(p)) a 2 Re(/*empty*/)

([ ]; b) 2 Tr(@ x) ([a]; b) 2 Tr(hai) a 6= a ([a0 ] _ b) 2 Re(hai) [ ] 2 Re(hai) v :v ([ ]; b) 2 Tr( )v( ) a 2 Re( )v( ) (a; b) 2 Tr(p) a 2 Re(p) ([ ]; a_b) 2 Tr( ]p[ ) a 2 Re( ]p[ ) 0

This abstraction ignores the synthetic attribute calculus that PRECC implements. For example, according to this semantics, the insertions `@1' and `@2' have the same semantics, whilst in fact they attach di erent values as attributes to parsed terminal or non-terminal. Note that the trace/refusal axiomatic semantics here provides a suitable basis for an implementation of PRECC as a logic program.

19