Syntax and Semantics

13 downloads 147 Views 926KB Size Report
Syntax and Semantics. COS 301. Programming Languages. Chapter 3 Topics. • Introduction. • The General Problem of Describing Syntax. • Formal Methods of ...
COS 301 Programming Languages Sebesta Chapter 3.1 – 3.4 Syntax and Semantics

Introduction • Syntax: the form or structure of the expressions, statements, and program units • Semantics: the meaning of the expressions, statements, and program units • Syntax and semantics provide a language’s definition A language that is simple to parse for the compiler is also simple to parse for the human programmer. N. Wirth

Chapter 3 Topics • • • • •

Introduction The General Problem of Describing Syntax Formal Methods of Describing Syntax Attribute Grammars Dynamic Semantics

Describing Syntax • Descriptions of syntax are intended to communicate facts about a language to an audience. Who? – Programmers want to find out what legal programs look like – Implementers want an exact, detailed definition – Tools such parser and scanner generators need an exact, detailed definition in a particular, machinereadable form – Tools often need ambiguity eliminated, while people often prefer a more readable grammar

Some Terminology

Some Terminology

• Any language (human or computer or otherwise) consists of a set of strings called sentences • The syntactic rules of a language specify what the legal strings are members of the language

• A sentence is a string of characters over some alphabet

– This does not of course preclude a language from having an infinite number of such strings

• Human languages are quite complex compared to computer languages

– Can be as small as two symbols e.g. {0,1}

• A language is a set of sentences • A lexeme is the lowest level syntactic unit of a language (e.g., 1.0, *, sum, begin) – Formal syntactic descriptions of a language are usually separated into lexical and syntactic rules. – Lexical rules specify how numeric literals are formed, language operators, keywords, etc.

• A token type is a category of lexemes (e.g., identifier)

Tokens and Lexemes

Lexical and Syntactic Rules

• Lexemes are partitioned into groups or types such as identifiers, operators, integer literals etc. • Often the term “token” is used in place of lexeme

• Lexical and syntactic rules are specified separately because they are specified by different types of grammars and are recognized by different types of automata • In particular, lexical rules are equivalent to regular expressions and specified by very restricted grammars called regular grammars

Index = 2 * count + x; Lexeme Index = 2 Count + 17

Token identifier assignment int literal identifier addition int literal

Value "index" 2 "count" 17

Formal Definition of Languages

Formal Definition of Languages

• Languages can be formally defined in two different ways:

• Recognizers

1. Recognizers 2. Generators

– A recognition device reads input strings over the alphabet of the language and decides whether the input strings belong to the language – Example: syntax analysis part of a compiler

• Generators – A device that generates sentences of a language – One can determine if the syntax of a particular sentence is syntactically correct by comparing it to the structure of the generator – Example: a grammar

Recognizers and Generators

The Chomsky Hierarchy

• There is a close relationship between recognizers and generators of a language • Given a context-free grammar (a generator) we can algorithmically construct a recognizer (a parser) • Many such systems have been constructed • The oldest (and still in wide use) is yacc

• Noam Chomsky developed the idea of formal grammars in the late 1950’s • Four levels of grammar: 1. Regular 2. Context-free 3. Context-sensitive 4. Unrestricted (recursively enumerable) • We will use only regular and context free grammars • Chomsky’s work has been extended into a set of 9 levels distinguished by recognition automata

– Yet Another Compiler Compiler

BNF and Context-Free Grammars

BNF Grammar: Formalism

• Context-Free Grammars

• The grammar of a programming language is a set of {P,T,N,S} with four members 1. A set of productions: P 2. A set of terminal symbols: T 3. A set of nonterminal symbols: N 4. start symbol: S ∈ N • A production has the form A →ω where A ∈ N and ω ∈ (N ∪ T)

– Grammars are language generators, meant to describe the syntax of natural languages – A context-free grammar defines a class of languages called context-free languages – CFGs are the most powerful grammars that are amenable to computation

• Backus-Naur Form (1959) – Invented by John Backus and Peter Naur to describe Algol 60 – BNF grammars are equivalent to context-free grammars – A similar notation was actually used over 2,000 years ago to describe the structure of Sanskrit (one of the most regular of human languages

Note that N and T are disjoint sets

BNF Fundamentals

BNF Notation

• BNF is a metalanguage (a language used to describe another language) • BNF uses abstractions to represent classes of syntactic structures • A simple assignment statement might be represented by the symbol

• Nonterminals are often enclosed in angle brackets – Examples of BNF rules: → identifier | identifier, → if then

-> =

• A production or rule shows how a nonterminal can be expanded • A rule has a left-hand side (LHS), which is a nonterminal, and a right-hand side (RHS), which is a string of terminals and/or nonterminals • Terminals cannot be expanded further

BNF Rules or Productions • A production is a rule for rewriting that can be applied to a string of symbols called a sentential form – The nonterminal symbols N identify grammatical categories such as identifier, integer, expression, program – The start symbol S identifies the principal grammatical category (usually Program). – The terminal symbolsT are the lexemes or tokens from which programs are constructed

• An abstraction (or nonterminal symbol) can have more than one RHS  | begin end

Describing Lists • Syntactic lists are described using recursion  ident | ident,

• A derivation is a repeated application of rules, starting with the start symbol and ending with a sentence (all terminal symbols)

Metasymbols

Definition: A Language

• The symbol  is used after the left nonterminal of a rule. Alternate (original) notation uses ::= • Nonterminals may be written in angle brackets or with a distinctive font

• The language L defined by a BNF grammar G = {P,T,N,S} is the set of all terminal strings that can be derived from the start symbol in zero or more steps.



• Selection (OR) is designated by the | character. • Parentheses may be used for grouping ( ... ). • Note that there are several different written styles for BNF but all are fundamentally equivalent

An Example Grammar   | ;  =  a | b | c | d  + | -  | const

An Example Derivation => => => = => a = => a = + => a = + => a = b + => a = b + const

Derivations

Example

• Every string of symbols in a derivation is a sentential form • A sentence is a sentential form that has only terminal symbols • A leftmost derivation is one in which the leftmost nonterminal in each sentential form is the one that is expanded • A derivation may be neither leftmost nor rightmost

• Given G below, does the string cbab belong to L(G)? In other words, is there a way to derive cbab from the start symbol? • G = { T, V, P, S } T = { a, b, c } V = { A, B, C, W } S={W}

• P consist of the rules: 1. 2. 3. 4. 5. 6. 7.

W AB A  Ca B  Ba B  Cb Bb C  cb Cb

or

::= ::= a ::= a ::= b ::= b ::= cb ::= b

Leftmost derivation

Rightmost derivation

• Begin with the start symbol W and apply production rules expanding the leftmost nonterminal.

• Begin with the start symbol W and apply production rules expanding the rightmost nonterminal.

W  AB  CaB cbaB

AB CaB  

Rule 1 Rule 2 Rule 6 Rule 5

cbaB cbab

W  AB  Ab  Cab

AB Ab Cab cbab

Rule 1 Rule 5 Rule 2 Rule 6

A shorter version of G

Parse Tree

• Using selection in the RHS G = { T, V, P, S } T = { a, b, c } V = { A, B, C, W } S={W}

• A tree representation of a derivation

1. W AB or 2. A  Ca 3. B  Ba | Cb | b 4. C  cb | b

::=
::= a ::= a | b | b ::= cb | b

a

= +





const

b

Parse trees • In a parse tree: – Each internal node of the tree corresponds to a step in the derivation. – Each child of a node represents a right-hand side of a production. – Each leaf node represents a symbol of the derived string, reading from left to right.

A Grammar for Assigment Statements ::= = ::= A | B | C ::= + | * | ( ) |

Example derivation

Ambiguity in Grammars

• A=B*(A + C)

• A grammar is ambiguous when it generates a sentential form that has two or more distinct parse trees

=> => => => => => => =>

A A A A A A A A

=> = = = * = B * = B * ( ) = B * ( + ) = B * ( A + ) = B * ( A + ) = B * ( A + C )

An ambiguous grammar

Ambiguity

• Simple assignment statements ::= = ::= A | B | C ::= + | * | ( ) |

A small difference

What causes ambiguity?

• Ambiguous ::= = ::= A | B | C ::= + | * | ( ) | • Not ambiguous ::= = ::= A | B | C ::= + | * | ( ) |

• In the example above the unambiguous grammar allows the expression to grow only on the right • Ambiguity is actually undecidable but there are some useful indicators such as the presence of more than one leftmost or rightmost derivation • Parsers can use extra-grammatical information to correct ambiguity

An Unambiguous Expression Grammar

Precedence of Operators

• If we use the parse tree to indicate precedence levels of the operators, we cannot have ambiguity

• Operator a has higher precedence than operator b if operator a should be evaluated before operator b in all parenthesis-free expressions involving only the two operators

 - |  / const| const



-

const

– Ex: 5 * 4 + 3 = 23

5 + 4 * 3 = 17

• “Evaluated before” means lower in the parse tree

/

const

const

No Precedence (right to left evaluation) • In this grammar any parse tree with multiple operators has the rightmost operator lowest in the tree ::= = ::= A | B | C ::= + | * | ( ) | • In A + B * C multiplication will be first • In A * B + C addition will be first

Precedence of C++ Operators -2

Precedence of C++ Operators -1 Precedence Operator Description 1 :: Scoping operator

9

Addition Subtraction Bitwise shift left Bitwise shift right Comparison less-than Comparison less-than-or-equal-to Comparison greater-than Comparison geater-than-or-equal-to

== Comparison equal-to != Comparison not-equal-to

int i = 2 + 3; left to right int i = 5 - 1; int flags = 33 > 1; if( i < 42 ) ... if( i 42 ) ... if( i >= 42 ) ... if( i == 42 ) ... if( i != 42 ) ...

left to right

10 & Bitwise AND 11 ^ Bitwise exclusive OR

flags = flags & 42; left to right flags = flags ^ 42; left to right

12 |

flags = flags | 42;

Bitwise inclusive (normal) OR

left to right

Associativity none

2

() [] -> . ++ --

3

! ~ ++ -+ * & (type) sizeof

Logical negation Bitwise complement Pre-increment Pre-decrement Unary minus Unary plus Dereference Address of Cast to a given type Return size in bytes

4

->* .*

if( !done ) ... flags = ~flags; for( i = 0; i < 10; ++i ) ... for( i = 10; i > 0; --i ) ... int i = -1; right to left int i = +1; data = *ptr; address = &obj; int i = (int) floatNum; int size = sizeof(floatNum); Member pointer selector ptr->*var = 24; left to right Member object selector obj.*var = 24;

5

* / %

Multiplication Division Modulus

(a + b) / 4; array[4] = 2; ptr->age = 34; obj.age = 34; left to right for( i = 0; i < 10; i++ ) ... for( i = 10; i > 0; i-- ) ...

int i = 2 * 4; float f = 10 / 3; int rem = 4 % 3;

left to right

Precedence of C++ Operators -3

13 && Logical AND

+ > < >= 6

Example Class::age = 2;

Grouping operator Array access Member access from a pointer Member access from an object Post-increment Post-decrement

if( conditionA && conditionB left to ) ... right left to if( conditionA || conditionB ) ... right right to int i = (a > b) ? a : b; left

14 ||

Logical OR

15 ? :

Ternary conditional (if-then-else)

= += -= *= /= 16 %= &= ^= |= =

Assignment operator Increment and assign Decrement and assign Multiply and assign Divide and assign Modulo and assign Bitwise AND and assign Bitwise exclusive OR and assign Bitwise inclusive (normal) OR and assign Bitwise shift left and assign Bitwise shift right and assign

int a = b; a += 3; b -= 4; a *= 5; a /= 2; a %= 3; flags &= new_flags; flags ^= new_flags; flags |= new_flags; flags = 2;

17 ,

Sequential evaluation operator

for( i = 0, j = 0; i < 10; i++, j++ left to ) ... right

right to left

Associativity

Associativity

• Associativity specifies whether operators of equal precedence should be evaluated left-toright or right-to-left

• A grammar can be used to define both associativity and precedence among the operators in an expression. Consider the conventonal rules:

– Ex: (left) – (right)

5 - 4 - 3 = 1 - 3 = -2 2 ** 3 ** 3 = 2 ** 9 = 512

+ and - are left-associative operators *, %, and / are left associative but have higher precedence than + and – Exponentiation (^) is right associative and has the highest precedence

• Consider this grammar G: Expr -> Expr + Term | Expr – Term | Term Term -> Term * Factor | Term / Factor | Term % Factor | Factor Factor -> Primary ** Factor | Primary Primary -> 0 | ... | 9 | ( Expr )

Parse tree for 4**2**3+5*6+7

Determining precedence and associativity • Precedence is determined by the length of the shortest derivation from start symbol to operator – Shorter derivations have lower precedence

• Associativity is determined by use of left or right recursion – Left Expr  Expr + Term | Expr - Term | Term

– Right Factor  Primary ^ Factor | Primary

Some design choices

Some design choices - 2

• C++ has 17 distinct levels of precedence, Java has 16, C has 15

• Smalltalk has no precedence and all operators are left-associative

– In all three languages some operators associate to the left and others to the right a = b < c ? * p + b * c : 1 { Statements } Statements -> Statements | Statement

Example

Statement

Parse Trees

• With which ‘if’ does the following ‘else’ associate if (x < 0) if (y < 0) y = y else y = 0;

- 1;

• Answer: either one!

Solving the dangling else problem

Audiences

1. Algol 60, C, C++, Pascal: associate each else with closest if; use { } or begin…end to override. 2. Algol 68, Modula, Ada, Visual Basic: use explicit delimiter to end every conditional (e.g., if…fi) 3. Java: rewrite the grammar to limit what can appear in a conditional:

• Grammars are means of communicating information to an audience

IfThenStatement -> if ( Expression ) Statement IfThenElseStatement -> if ( Expression ) StatementNoShortIf else Statement

The category StatementNoShortIf includes all except IfThenStatement.

– Programmers want to find out what legal programs look like – Implementers want an exact, detailed definition – Tools such parser and scanner generators need an exact, detailed definition in a particular, machinereadable form – Tools often need ambiguity eliminated, while people often prefer a more readable grammar

• Grammars therefore can vary with the audience

Levels of Precedence and Complexity

Extended BNF

• C, C++, and Java have a large number of operators and precedence levels • For each precedence level we need to introduce a new non-terminal • Grammar can get large and difficult to read • Instead of using a large grammar, we can:

• BNF was developed in the late 1950’s; still very widely used • However the original BNF has a few minor inconveniences such as recursion instead of iteration and verbose selection syntax

– Write a smaller ambiguous grammar, and – Specify precedence and associativity separately

– Note that for some applications such as recursive descent parsing, left recursion is forbidden

• Extended BNF (EBNF) increases readability and writability – Expressive power is unchanged: still CFGs

• Several variations exist

Extended BNF: Optional parts

Extended BNF: Alternative RHS

• Optional parts are placed in brackets [ ]

• Alternative parts of RHSs are placed inside parentheses and separated via vertical bars

→ ident ([])

• Replaces → ident() | → ident ()

→ (+|-) const

• Replaces → + const | - const

Extended BNF: Recursion

BNF and EBNF

• Repetitions (0 or more) are placed inside braces { }

• BNF

→ letter {letter|digit}

• Replaces → letter | letter | digit

 + | - |  * | / |

• EBNF  {(+ | -) }  {(* | /) }

EBNF and Associativity

Recent Variations in EBNF

• Note that the production: Expr -> Term { ( + | - )

Term }

does not seem to specify the left associativity that we have in Expr -> Expr + Term | Expr + Term | Term

• • • •

Alternative RHSs are put on separate lines Use of a colon instead of => Use of opt for optional parts Use of oneof for choices

• In EBNF left recursion is usually assumed. – Explicit recursion is used for right associative operators – Some EBNF grammars may specify associativity outside of the grammar

EBNF to BNF

Syntax Diagrams

• We can always rewrite an EBNF grammar as a BNF grammar. E.g., A -> x { y } z • can be rewritten: A -> x A' z A' ->  | y A' • Note  is the standard symbol used in grammars for the “empty string” • Rewriting EBNF rules with ( ), [ ] can be done in a similar fashion • While EBNF is no more powerful than BNF, its rules are often simpler and clearer for human readers

• Similar to EBNF • Introduced by Jensen and Wirth with Pascal in 1975

Ex: Expressions with addition

A More Complex Example

An Expression Grammar

Static Semantics • Context-free grammars (CFGs) cannot describe all of the syntax of programming languages • “Static semantics” has only an indirect relationship with meaning – Static semantic rules deal with the legal form of programs (syntax) – Most rules deal with typing systems – Called “static” because analysis is done at compile time

• Dynamic semantics describe the meaning (runtime behavior) of a program From http://en.wikipedia.org/wiki/Syntax_diagram

Static Semantics

Attribute Grammars

• Typical items for static semantic analysis:

• Attribute Grammars (AGs) were developed by Donald Knuth 1968 • AGs are additions to CFGs to carry some semantic information on parse tree nodes • CFG plus

– Type of RHS of an expression must match the type of the LHS (lvalue) – All variables must be declared before being referenced

• First restriction can be expressed in BNF but only in a very cumbersome way • The second restriction cannot be expressed in BNF – Consider the common-sense meaning of “context free” and “context sensitive”

– Attributes • Associated with terminals and non-terminals, similar to variables in that values can be assigned

– attribute computation functions • Aka semantic functions, associated with grammatical rules, specify how attribute values are computed

– predicate functions • State semantic rules; associated with grammatical rules

Attribute Grammars : Definition

Synthesized Attributes

• Def: An attribute grammar is a context-free grammar G = {P,T,N,S} with the following additions:

• Synthesized attributes are used to pass semantic information UP in parse tree

– For each grammar symbol x in N,T there is a set A(x) of attribute values • A(x) consists of two disjoint sets S(x) and I(x) called synthesized attrbutes and inherited attributes

– Each rule has a set of functions that define certain attributes of the nonterminals in the rule – Each rule has a (possibly empty) set of predicates to check for attribute consistency

– Synthesized = computed – For a grammar rule of the form X0-> X1 . . . Xn the synthesized attributes of X0 are computed as a function f(A(X1) . . . A(X1n) ) – The value of a synthesized attribute therefore depends only on the value of the attributes of that nodes children

Inherited Attributes

Predicate Functions

• Inherited attributes are used to pass semantic information DOWN the parse tree

• A predicate is a Boolean expression on the union of the attribute set {A(X0) . . . A(Xn) } and a set of literal values • The only derivations allowed in an attribute grammar are those in which every predicate associated with a nonterminal is true. • A false predicate indicates a rule violation

– Child nodes inherit from the parent – Synthesized = computed – For a grammar rule of the form X0-> X1 . . . Xn the inherited attributes of Xj are computed as a function f(A(X0) . . . A(Xj-1) ) – The value of an inherited attribute therefore depends only on the value of the attributes of the parent and (usually) the left siblings

Attributed (decorated) parse trees

Intrinsic Attributes

• The parse tree has a possibly empty set of attributes attached to each node. • When all attributes have been computed the tree is fully attributed or decorated • Conceptually you think of the parse as producing a parse tree, then attribute values are computed in a second pass

• Are synthesized attributes whose values are determined outside the parse tree

Attribute Grammars: Definition

Attribute Grammars: An Example

• Let X0  X1 ... Xn be a rule • Functions of the form S(X0) = f(A(X1), ... , A(Xn)) define synthesized attributes • Functions of the form I(Xj) = f(A(X0), ... , A(Xn)), for i