Jun

2010

In the previous article, we skimmed the surface of automated text parsing and set the stage for our impending exploration of the GLL algorithm itself. However, before we can move ahead and do just that, we should first build up some idea of what the requirements are for truly generalized parsing and what sort of problems we are likely to encounter.

I’m going to assume you already have a working understanding of context-free grammars and how to read them. If you don’t, then I recommend you to the Wikipedia page on CFGs. Specifically, the examples are quite instructive.

### Recursion

S ::= '(' S ')' | '(' ')'

In this grammar, the `S`

non-terminal is recursive because one of its productions refers back to itself. Specifically, the first rule corresponding to the `S`

non-terminal is of the form *α S β*, where

*α*and

*β*stand for some arbitrary rule fragments (in this case,

`'('`

and `')'`

, respectively).When a non-terminal maps to a production which is recursive in its *first* token, we say that rule is *left-recursive*. For example:

E ::= E '+' N | E '-' N | N N ::= '1' | '2' | '3' | ...

In this grammar, the `E`

non-terminal is left-recursive in two of its three productions. Left-recursion is a particularly significant property of a grammar because it means that any left-to-right parse process would need to parse `E`

by *first* parsing `E`

itself, and then parsing `'+'`

and finally `N`

(assuming that the parser is using the first production). As you can imagine, it would be very easy for a naïve parsing algorithm to get into an infinite loop, trying to parse `E`

by first parsing `E`

, which requires parsing `E`

, which requires parsing `E`

, etc.

Mathematically, left-recursive productions are always of the form *α E β* where

*α —> ε*. In plain-English, this means that a production is left recursive if the part of the production preceding the recursive token represents the empty string (

*ε*). This is a very nice way of defining left-recursion, because it allows for a specific type of left-recursion known as

*hidden*left-recursion. For example:

A ::= B A '.' | '.' B ::= ',' |

Notice how the second production for `B`

is empty? This means that `B`

*can* map to *ε*, and thus `A`

exhibits hidden left-recursion. The difference between hidden and direct left-recursion is that hidden left-recursion is obscured by other rules in the grammar. If we didn’t know that `B`

had the potential to produce the empty string, then we would never have realized that `A`

is left-recursive.

LR parsing algorithms (such as tabular LALR or recursive-ascent) can handle direct left-recursion without a problem. However, not even Tomita’s GLR can handle hidden left-recursion (which technically means that the GLR algorithm isn’t fully general). Hidden left-recursion is a perfectly valid property for a context-free grammar to exhibit, and so in order to be fully general, a parsing algorithm must be able to handle it. As it turns out, this is just a little bit troublesome, and many papers on parsing algorithms spend a large majority of their time trying to explain how they handle hidden left-recursion.

It’s worth noting that left-recursion cannot be handled by top-down algorithms (such as tabular LL(k) or recursive-descent) without fairly significant contortions. However, such algorithms have no trouble at all with other forms of recursion (such as our original recursive example with `S`

). Left-recursion arises very naturally in many grammars (particularly involving binary forms such as object-oriented method dispatch or mathematical operators) and is one of the primary reasons why many people prefer algorithms in the LR family over LL algorithms.

### Ambiguity

It is perhaps surprising that context-free grammars are not required to be unambiguous. This means that a grammar is allowed to accept a particular input by using more than one possible sequence of rules. The classic example of this is arithmetic associativity:

E ::= E '+' E | E '-' E | '1' | '2' | '3' | '4' | ...

This is an extremely natural way to encode the grammar for mathematical plus and minus. After all, when we mentally think about the `+`

operator, we imagine the structure as two expressions separated by `+`

, where an expression may be a primitive number, or a complex expression like another addition or a subtraction operation. Unfortunately, this particular encoding has a rather problematic ambiguity. Consider the following expression:

4 + 5 + 2

Clearly this is a valid expression, and a parser for the example grammar will certainly accept it as input. However, if we try to generate a parse tree for this expression, we’re going to run into two possible outcomes:

Literally, the question is whether or not we first expand the left or the right `E`

in the top-level `+`

expression. Expanding the left `E`

will give us the tree to the left, where the first two operands (`4`

and `5`

) are added together with the final result being added to `2`

. Expanding the right `E`

gives us the tree to the right, where we add `5`

and `2`

together, adding *that* to `4`

.

Of course, in the case of addition, associativity doesn’t matter too much, we get the same answer either way. However, if this were division, then associativity could make all the difference in the world. *(4 / 5) / 2 = 0.4*, but *4 / (5 / 2) = 1.6*. The point is that we can follow all of the rules set forth by the grammar and arrive at two very different answers. This is the essence of ambiguity, and it poses endless problems for most parsing algorithms.

If you think about it, in order to correctly handle ambiguity, a parser would need to return not just one parse tree for a particular input, but *all possible* parse trees. The parser’s execution would have to keep track of all of these possibilities at the same time, somehow following each one to its conclusion maintaining its state to the very end. This is not an easy problem, particularly in the face of grammars like the following:

S ::= S S S | S S | 'a'

Clearly, this is a contrived grammar. However, it’s still a valid CFG that a generalized parsing algorithm would need to be able to handle. The problem is that there are an *exponential* number of possible parse trees for any given (valid) input. If the parser were to naïvely follow each and every one of these possibilities one at a time, even on a short string, the parse process would take more time than is left in the age of the universe as we know it. Obviously, that’s not an option.

### Conclusion

As you can see, generalized parsing has some very thorny problems to solve. It’s really no wonder that the algorithms tend to be cryptic and difficult to understand. However, this is not to say that the problems are insurmountable. There are some very elegant and easy-to-understand algorithms for solving these problems, and GLL is one of them.

In the next article, we will start looking at the GLL algorithm itself, along with that chronically under-documented data structure at its core, the graph-structured stack (GSS).

## Comments

Dan,

I’m learning about some newer parsing techniques and I just wanted to ask, is GLL the same or similar to Earley parsing? If it’s different, how so?

Ben

I’ve recently begun to read Aho et al.’s

Compilers, and I’m happy I did so before reading your series. Makes it far easier to follow.Just a technical question from blogger to blogger: what tool did you use to generate those graphs? I intend to write a series of blog posts about the thermal simulation of buildings, which also makes heavy use of graphs and nodes. I’d very much appreciate it if you would share.

David

What do you think about generalized parsing vs PEG parsing with left recursion?

This post helped me to track down a problem in my parser. Turns out I had some rules that had an exponential amount of possible parse trees. I’m using a parser generator that has automatic backtracking and it was causing the backtracking algorithm to appear to hang. I couldn’t figure out what was going on but when I read this post I realized that had to be the problem. Thanks!

Really great articles!! But please continue them soon – you’ve left us on a bit of a cliffhanger here and I’m pretty eager to know about the GLL algorithm…

“It is perhaps surprising that context-free grammars are not required to be unambiguous. This means that a grammar is allowed to accept a particular input by using more than one possible sequence of rules.”

Technically speaking, CFGs are generating grammers, not parsing grammars. From that perspective, it’s quite natural that they’re not required to be unambiguous. You don’t speak of generating grammars as “accepting input”.

I am eagerly awaiting Part 3.

Jules, it is a hack that frankly, doesn’t work right, often generating the wrong parse.

http://tratt.net/laurie/research/publications/papers/tratt__direct_left_recursive_parsing_expression_grammars.pdf

Hi, I would also be happy to read the next article of the series – or to see performance improvements for your GLL combinators. In my research group we considered using them last year (IIRC when working on the OOPSLA paper for TypeChef), but we ran away because of the comment “performance at the moment is non-existant” on the project README. Afterwards you seem to say “but for non-ambiguous grammars it’s not so bad, actually we’re faster”, so I’m really confused.

Thanks for the articles in the series so far; it’s a fascinating topic. I too would be greatly interested in the next one!

## Post a Comment