Recursive Descent Parsers

A recursive descent parser is a simple and effective approach to converting a computer language grammar to a program that understands that language. One starts with a grammar in Backus-Naur Form (BNF) and ends up with a compiler or interpreter. Whilst it is a straight forward process, there are some “gotcha”s that most articles fail to mention.

Grammars In Practice

A BNF grammar is often explained with examples like:
<expr> ::= <term>|<expr><addop><term>

Here an expression (<expr>) consists of a term, or of an expression followed by an “add operator” (<addop> eg ‘+’) then a term. I tend to write the above as two rules:

expr = term
expr = expr "+" term

In a program that recognises something in the language specified by the grammar, it looks for things on the RHS of a rule, and if the input matches the pattern, it announces the LHS. In the real world, that means we look for a “statement” like:

statement = "LET" var "=" expr
statement = "FOR" var "=" expr "TO" expr
statement = ...

When our parser says it found a “statement”, we ask which one and proceed to generate or execute code to carry out the action that the statement is expected to do.

I’m using “var” above to mean variable. For the BASIC language, var can be an identifier (ID) such as X or A$; or it can be an array reference such as A(3) or B(I,J). (If you stick to the constraints on BASIC given in ECMA-55, the “FOR …” statement is a bit simpler as the variable for it is limited to just an ID and arrays aren’t allowed.) The generic term var is:

var = ID
var = ID "(" expr ")"
var = ID "(" expr "," expr ")"

In BASIC, arrays look a lot like function calls because both use “(” and “)”. You can only tell by knowing the names of every function and it’s a function if it’s one of those and it’s an array if it isn’t. For a compiler / interpreter, the process is the same. It has to have a list of “key words” (KW). If the parser sees something that might be an ID, it checks it against a list of KW names and, if it isn’t one of the those, it’s an ID. Otherwise, it’s a KW.

You can do the key word check in the grammar like:

item = fun
item = var
fun  = "SIN" "(" expr ")"
fun  = "COS" "(" expr ")"
fun  = ...
var  = ((as above))

Whilst this seems like a good idea as every function (fun) is listed in the grammar, the processing time is much worse than returning a special KW type instead of an ID. The reason for the longer processing time isn’t obvious so it is an easy pitfall to stumble into. The longer processing time happens because recursive descent parsers try rules and if a rule fails they move onto the next one. If “item” is mentioned in multiple rules, the check for “is it a keyword / function name” will get repeated. For example, if you had:

expr = item "+" item
expr = item "-" item
expr = item "*" item
expr = item "/" item


You’d potentially check “is it one of the function names” four times. If expr appears in, let’s say, 15 of the language’s statements; that becomes 60 times. In practice, there are often more levels than just “statement”, “expr” and “item” and each has multiple rules so the repetitions quickly build to far more than 60.

Gotchas

The one that is fairly widely known is “no left recursion”.

Despite that, the very first example above had ‘<expr> ::= <expr><addop><term>’. There are lots of examples like this around and, from a “describing a language” perspective, it’s reasonable and can be the best way to show something. However, if you’re implementing the language using a recursive descent parser, it will be fatal. The code for such a parser looks like:

int get_expr(...) {
    if (!get_expr(...)) return FALSE;
    if (!get_match("+",...)) return FALSE;
    if (!get_term(...)) return FALSE;
    return TRUE;
}


The first thing get_expr() does is get_expr(). There’s no checks beforehand and no criteria. It’s just “always”. Your recursive descent is going to go a very long way down, until you’ve used up all of the computer’s memory. Left recursion is fatal with these parsers.

The solution is to change the grammar without affecting the language it represents. A simple solution is to note the second part is a term and perhaps change it to ‘expr = term “+” term’. We’d be looking for something more specific than a whole expression. It solves the immediate problem but begs the question, “why didn’t they just do that in the first place?” If you look at the original, expr = expr “+” term, you’ll see the idea that an expression is something that ends with “+” and term. That could be, given that expr was also defined as term in the original, term “+” term or term “+” term “+” term, etc. So, we’re missing something with our simple fix. I tend to use:

expr = term expr2+
expr2 = "+" term


Here the ‘expr2+’ means 1 or more instances of expr2.

In recursive descent parser code, it looks like:

int get_expr(...) {
    if (!get_term(...)) return FALSE;

    if (!get_expr2(...)) return FALSE;
    while (get_expr2(...)) ;

    return TRUE;
}


In the real world, we’d also add ‘expr2 = “-” term’ to make it recognise 2+3-4+5+6 etc.

More Specific First

This one has no mathematical basis but it is quite practical. When we write programs to recognise a grammar, we often implement the first rule first. Compiler generators tend to work the same way too. The point to this one is if a more general rule gets matched first, the more specific ones won’t even get tried. A practical example:

var = ID
var = ID "(" expr ")"


This is the array example again. If this gets coded in this order, when the parser sees an ID it will decide it has found a “var”. It won’t even consider the second rule.

Shorter Rules Are Better

This one is a biggie and I haven’t seen it mentioned (aside from “processing time can grow exponentially” without explaining why or how to avoid that). Recursive Descent Parsers have a lot of pluses but there are some minuses too. One of its features is also a minus. This is its ability to backtrack.

A recursive descent parser backtracks by trying other rules. It gets a certain way through a rule and then finds that the input doesn’t match the rule so it tries the next rule instead. This means it backtracks by however far it got into the current rule.

Normally, we’d consider backtracking to be fairly fast – it’s just a matter of unget-ing a few input tokens.

In our case however, there are some ramifications. Consider these grammar lines:

stmnt = "FOR" var "=" expr "TO" expr "STEP" expr
stmnt = "FOR" var "=" expr "TO" expr


If the second rule applies, you don’t know it until you are most of the way through the first rule. Rolling back looks like unget-ing six items. It might be a few more if the “expr”s aren’t simple numbers but the process should still be fast.

What isn’t so obvious though, is the amount of work that went into get-ing those six or so items.

Almost every rule has alternates, eg:

expr   = and_op "OR" and_op
expr   = and_op
and_op = not_op "AND" not_op
and_op = not_op
not_op = "NOT" comp
not_op = comp
comp   = addsub "<" "=" addsub
comp   = addsub ">" "=" addsub
comp   = addsub "<" ">" addsub
comp   = addsub "<" addsub
comp   = addsub ">" addsub
comp   = addsub "=" addsub
comp   = addsub
addsub = muldiv "+" muldiv
addsub = muldiv "-" muldiv
addsub = muldiv
muldiv = unary"*" unary
muldiv = unary"/" unary
muldiv = unary
unary = "-" item
unary = item
item = NUM
item = STR
item = var
item = "(" expr ")"
item = func
var = ID "(" expr "," expr ")"
var = ID "(" expr ")"
var = ID
func = RSVD "(" expr "," expr "," expr ")"
func = RSVD "(" expr "," expr ")"
func = RSVD "(" expr ")"
func = RSVD "(" ")"


By the way, the above shows how precedence is implemented (but it doesn’t address repetitions like 2+3+4).

  • An expr has 2 rules. To get through those, and_op is evaluated 3 times.
  • There are 2 and_op rules. To get through those, not_op is evaluated 3 times.
  • A not_op has 2 rules. It may evaluate comp twice.
  • A comp has 7 rules. First glance suggests you could end up evaluating addsub 13 times.
  • An addsub could, in practice, evaluate muldiv 4 times.
  • A muldiv may evaluate unary 4 times.
  • A unary could evaluate item 2 times.
  • An item involves 5 rules, some of which add in 3 and 4 more rules.

In all, an expr could require 2+3x(2+3x(2+2x(7+13x(3+4x(3+4x(2+2x(5+3+4))))))) (=8396) rules to be evaluated. It is slightly better (about 4000 rules) as addsub doesn’t get evaluated all 13 times. It gets worse when you add rules to allow expressions like 2+3+4. It also gets worse when you consider a function call (func). A function call like INSTR(“haystack”,”needle”) may take 4000 rule combinations to recognise; but it also recurses into expr 4 times more as it almost matches the first func rule. The result isn’t 4000 x 4000 x 4000; but it isn’t simply 4000+4000+4000+4000+4000 either.

In one version of the grammar for Basic1 between version 0.00 and version 0.01, I had Basic1 print each rule it checked. I was trying to understand which one of them – there were only 80 – was taking up so much compile time. To my surprise, ’10 PRINT INSTR(“hay”,”a”)’ produced a 190 MB debug file which contained over 5,000,000 lines. It wasn’t any one rule – it was the combinations.

Given your parser is going to backtrack by trying the next rule, you can make it more efficient by reducing the rule lengths. It will repeat less work that way and re-evaluate fewer rules.

If your parser generator supports it, or if you are hand-coding the rules, use grammar operators like * (zero or more of), + (1 or more of) and ? (0 or 1 of). These keep you in the current rule and preserve the work done before that point. An example is:

addsub     = muldiv addsub2*
addsub2    = "+" muldiv
           = "-" muldiv
muldiv     = power muldiv2*
muldiv2    = "*" power
           = "/" power


I went from over 5,000,000 rules evaluated for INSTR(…) to 50,000 and then to under 500 by making these sorts of changes to the grammar. At five million rules, the compile time for one line of code is over a second on my laptop. At less than 500 rules, the project is practical again.

I hope this information is helpful.

It's only fair to share...Share on FacebookTweet about this on TwitterShare on Google+Share on LinkedInShare on StumbleUponDigg thisPin on PinterestEmail this to someone

One thought on “Recursive Descent Parsers”

Leave a Reply

Your email address will not be published. Required fields are marked *