Week 2 #
Programming Languages define syntax formally.
Lexical Rules: specify the form of the ‘building blocks’ of a PL
- i.e. Comment Syntax, Tokens (keywords, literals, operators, etc.) and delimiters, White space
Syntax: specifies how the ‘building blocks’ are put together
Regular Expressions #
Notation #
Kleene Star (
*
): 0 or more repetitionsAlternation (
+
or|
): denotes choiceGrouping: marked by parentheses
()
Empty/Null String: denoted by epsilon: $\epsilon$
Empty Language: language with no strings, denoted by $\empty$
Examples:
(0+1)* 0 or more of ("0" OR "1")
1*(:+;)* 0+ of "1", 0+ of (":" OR ";")
(a+b)*aa(a+b)* 0+ of ("a" OR "b"), "aa", 0+ of ("a" OR "b")
Formal Definition #
Let $\Sigma$ denote a finite alphabet. A given string is a Regular Expression (RE) iff it can be formed from primitive REs $$ [\empty, \epsilon, \textrm{ any } a | a \in \Sigma ] $$
Rules of Applications of Regular Expressions #
$a$ and $b$ are regular expressions
$a + b$: Union
$ab$ : Concatenation
$a^\star$: Kleene Closure
$(a)$: Grouping
$\Sigma^\star$: the set of all finite strings over the alphabet $\Sigma$
Definition of a Regular Language #
Languages that can be expressed by Regular Expressions.
Given a RE $r$ over an alphabet $\Sigma$, $L(r)$, a language associated with $r$ is defined by:
$L(\empty)$: language that contains no strings
$L(\epsilon)$: language that only contains ${\epsilon}$
For $a \in \Sigma$, $L(a)$: language that only contains ${a}$
Rules of Regular Languages #
$a$ and $b$ are regular expressions
$L(a + b) = L(a) \cup L(b)$
$L(ab) = L(a)L(b)$
$L((a)) = L(a)$
$L(a^\star) = (L(a))^\star$
- $S^\star$ is the smallest superset of $S$ containing $\epsilon$ and is closed under concatenation
Grammars #
- Informal Grammar: just rules to follow
- i.e. “Don’t end a sentence with a preposition”
- Formal Grammar
Language: set of strings
Grammar: specifies which strings are in the language
Types of Formal Grammar #
Chomsky’s Hierarchy: ordered by ascending expressiveness.
- Regular Grammars
- Context-Free Grammars
- Context-Sensitive Grammars
- Phase-Structure Grammars
Context-Free Grammars #
More powerful than REs.
Parts of a Context Free Grammar #
Terminals: atomic symbols of the language
Non-Terminals: “variables” used in the grammar
Start Symbol: special non-terminal chosen as the top-level construct of the language
- The first non-terminal listed is chosen as the start symbol by convention
Productions: set of rules that specify legal ways that a non-terminal can be rewritten as a sequence of terminals and non-terminals
Formal Definition of CFGs #
A Context-Free Grammar $G$ is a tuple ${\Sigma, V, S, P}$ where
$\Sigma$: set of terminals
$V$: set of non-terminals
$S$: Start Symbol where $S \in V$
$P$: set of productions
- Rule Form: $X \Longrightarrow w$, where $X \in V, w \in (\Sigma \cup V)^\star$
Note: $\Sigma \cap V = \empty$
A Production Rule $X \Longrightarrow w$ can be applied to a string in the form of $sXt$ where $s,t \in (\Sigma \cup V)^\star$ to produce a string $swt$
A grammar $G$ generates string $s$ if $\exists$ a finite sequence of applications of productions, such that the first rule is applied to the start symbol, and the last rule produces $s$, $s \in \Sigma^\star$. Written as $G \Rightarrow^\star s$.
Definition of a Context-Free Language #
Languages that can be expressed by Context-Free Grammars.
The language generated by a grammar $G$ is $L(G) = {s \mid G \Rightarrow^\star s}$
Regular Grammars #
More restricted than CFGs.
Productions in a Regular Grammar #
Left Recursive: 0 or 1 non-terminal, appears as the left-most symbol on the RHS of the rule
<S> ::= <T> a b <T> ::= a | <T> b
Right Recursive: 0 or 1 non-terminal, appears as the right-most symbol on the RHS of rule
<S> ::= a <T> <T> ::= b <T> | a <T>
Backus-Naur Form (BNF) #
The notation typically used to describe Context-Free Grammars.
- Non-Terminals:
<NonTerminal>
- OR:
|
- Productions:
LHS ::= RHS
orLHS ==> RHS
Extended Backus-Naur Form (EBNF) #
Extensions to BNF to make it more concise, but not more powerful
Repetitions:
{ x }
- 0 or more repetitions ofx
- $x^+$: one or more repetitions
- $x^n$: maximum of $n$ repetitions
Optional:
[ x ]
- 0 or 1 occurrences ofx
Grouping:
()
Derivations #
The sequence of applications of productions to generate a string.
- A string is in the language generated by a grammar iff there is a derivation for it
Example, Write 3.14
using the following CFG:
<r> ==> <p> . <p>
<p> ==> <d> | <d> <p>
<d> ==> 0 | ... | 9 # ... just means 1 through 8
Solution:
<r> ==> <p> . <p>
==> <d> . <d> <p>
==> 3 . 1 <d>
==> 3 . 1 4
Parse Trees #
Shows the structure within a string of the language.
Parsing is the process of producing a parse tree.
A string is in the language generated by a grammar iff there is a parse tree for the string.
Structure of a Parse Tree #
Root: start symbol
Leaf: terminal
Internal Node: non-terminal
- Children correspond in-order, to the RHS of one of its products in the grammar
Syntactic Ambiguity #
A grammar is ambiguous iff it generates a string for where there are two or more Parse Trees
A string is ambiguous w/r/t a grammar iff that grammar generates two or more Parse Trees
Note: Two or more derivations does not mean a string is ambiguous.
Dealing with Ambiguity #
Include Delimiters
- i.e. Curly Brackets,
fi
, etc.
- i.e. Curly Brackets,
Impose Associativity and Precedence
i.e. Multiplication gets higher precedence than Addition
Introduce a non-terminal for each precedence level
For the same precedence level, operators can be
Left Associative: recursive term before non-recursive
Right Associative: recursive term after non-recursive