Week 2

Week 2 #

Programming Languages define syntax formally.

  • Lexical Rules: specify the form of the ‘building blocks’ of a PL

    • i.e. Comment Syntax, Tokens (keywords, literals, operators, etc.) and delimiters, White space
  • Syntax: specifies how the ‘building blocks’ are put together

Regular Expressions #

Notation #

  • Kleene Star (*): 0 or more repetitions

  • Alternation (+ or |): denotes choice

  • Grouping: marked by parentheses ()

  • Empty/Null String: denoted by epsilon: $\epsilon$

  • Empty Language: language with no strings, denoted by $\empty$

Examples:

(0+1)*			0 or more of ("0" OR "1") 

1*(:+;)*		0+ of "1", 0+ of (":" OR ";")

(a+b)*aa(a+b)*  0+ of ("a" OR "b"), "aa", 0+ of ("a" OR "b")

Formal Definition #

Let $\Sigma$ denote a finite alphabet. A given string is a Regular Expression (RE) iff it can be formed from primitive REs $$ [\empty, \epsilon, \textrm{ any } a | a \in \Sigma ] $$

Rules of Applications of Regular Expressions #

$a$ and $b$ are regular expressions

  • $a + b$: Union

  • $ab$ : Concatenation

  • $a^\star$: Kleene Closure

  • $(a)$: Grouping

  • $\Sigma^\star$: the set of all finite strings over the alphabet $\Sigma$

Definition of a Regular Language #

Languages that can be expressed by Regular Expressions.

Given a RE $r$ over an alphabet $\Sigma$, $L(r)$, a language associated with $r$ is defined by:

  • $L(\empty)$: language that contains no strings

  • $L(\epsilon)$: language that only contains ${\epsilon}$

  • For $a \in \Sigma$, $L(a)$: language that only contains ${a}$

Rules of Regular Languages #

$a$ and $b$ are regular expressions

  • $L(a + b) = L(a) \cup L(b)$

  • $L(ab) = L(a)L(b)$

  • $L((a)) = L(a)$

  • $L(a^\star) = (L(a))^\star$

    • $S^\star$ is the smallest superset of $S$ containing $\epsilon$ and is closed under concatenation

Grammars #

  • Informal Grammar: just rules to follow
    • i.e. “Don’t end a sentence with a preposition”
  • Formal Grammar
    • Language: set of strings

    • Grammar: specifies which strings are in the language

Types of Formal Grammar #

Chomsky’s Hierarchy: ordered by ascending expressiveness.

  • Regular Grammars
  • Context-Free Grammars
  • Context-Sensitive Grammars
  • Phase-Structure Grammars

Context-Free Grammars #

More powerful than REs.

Parts of a Context Free Grammar #

  • Terminals: atomic symbols of the language

  • Non-Terminals: “variables” used in the grammar

  • Start Symbol: special non-terminal chosen as the top-level construct of the language

    • The first non-terminal listed is chosen as the start symbol by convention
  • Productions: set of rules that specify legal ways that a non-terminal can be rewritten as a sequence of terminals and non-terminals

Formal Definition of CFGs #

A Context-Free Grammar $G$ is a tuple ${\Sigma, V, S, P}$ where

  • $\Sigma$: set of terminals

  • $V$: set of non-terminals

  • $S$: Start Symbol where $S \in V$

  • $P$: set of productions

    • Rule Form: $X \Longrightarrow w$, where $X \in V, w \in (\Sigma \cup V)^\star$

Note: $\Sigma \cap V = \empty$

  • A Production Rule $X \Longrightarrow w$ can be applied to a string in the form of $sXt$ where $s,t \in (\Sigma \cup V)^\star$ to produce a string $swt$

  • A grammar $G$ generates string $s$ if $\exists$ a finite sequence of applications of productions, such that the first rule is applied to the start symbol, and the last rule produces $s$, $s \in \Sigma^\star$. Written as $G \Rightarrow^\star s$.

Definition of a Context-Free Language #

Languages that can be expressed by Context-Free Grammars.

The language generated by a grammar $G$ is $L(G) = {s \mid G \Rightarrow^\star s}$

Regular Grammars #

More restricted than CFGs.

Productions in a Regular Grammar #

  • Left Recursive: 0 or 1 non-terminal, appears as the left-most symbol on the RHS of the rule

    <S> ::= <T> a b	
    <T> ::= a | <T> b
    
  • Right Recursive: 0 or 1 non-terminal, appears as the right-most symbol on the RHS of rule

    <S> ::= a <T>
    <T> ::= b <T> | a <T>
    

Backus-Naur Form (BNF) #

The notation typically used to describe Context-Free Grammars.

  • Non-Terminals: <NonTerminal>
  • OR: |
  • Productions: LHS ::= RHS or LHS ==> RHS

Extended Backus-Naur Form (EBNF) #

Extensions to BNF to make it more concise, but not more powerful

  • Repetitions: { x } - 0 or more repetitions of x

    • $x^+$: one or more repetitions
    • $x^n$: maximum of $n$ repetitions
  • Optional: [ x ] - 0 or 1 occurrences of x

  • Grouping: ()

Derivations #

The sequence of applications of productions to generate a string.

  • A string is in the language generated by a grammar iff there is a derivation for it

Example, Write 3.14 using the following CFG:

<r> ==> <p> . <p>
<p> ==> <d> | <d> <p>
<d> ==> 0 | ... | 9		# ... just means 1 through 8

Solution:

<r> ==> <p> . <p>
	==> <d> . <d> <p>
	==> 3 . 1 <d>
	==> 3 . 1 4

Parse Trees #

Shows the structure within a string of the language.

  • Parsing is the process of producing a parse tree.

  • A string is in the language generated by a grammar iff there is a parse tree for the string.

Structure of a Parse Tree #

  • Root: start symbol

  • Leaf: terminal

  • Internal Node: non-terminal

    • Children correspond in-order, to the RHS of one of its products in the grammar

Syntactic Ambiguity #

  • A grammar is ambiguous iff it generates a string for where there are two or more Parse Trees

  • A string is ambiguous w/r/t a grammar iff that grammar generates two or more Parse Trees

Note: Two or more derivations does not mean a string is ambiguous.

Dealing with Ambiguity #

  • Include Delimiters

    • i.e. Curly Brackets, fi, etc.
  • Impose Associativity and Precedence

    • i.e. Multiplication gets higher precedence than Addition

    • Introduce a non-terminal for each precedence level

    • For the same precedence level, operators can be

      • Left Associative: recursive term before non-recursive

      • Right Associative: recursive term after non-recursive