May
2010
Generalized parsing is probably the most misunderstood topic in the entire field of automated language processing. There is a persistent perception that generalized parsing is slow and impractical. Even worse, most people seem to believe that generalized parsing is complicated and unpredictable (a perception deriving from the extremely obfuscated nature of most generalized parsing algorithms). This is all very unfortunate, because none of it is true anymore.
Now, before I move forward and justify that rather bold statement, I should probably define what I mean when I say “generalized parsing”. Parsing algorithms generally fall into one of any number of categories. The most common ones are:
- LL(k) (e.g. ANTLR, JavaCC)
- Non-Commutative LL(*) (e.g. most hand-written parsers, parser combinators)
- Memoized Non-Commutative LL(*) (e.g. Packrat parsers, Scala 2.8 parser combinators)
- LR(k) (e.g. YACC, Bison)
- Generalized
These are arranged roughly in order from least to most powerful. This means that the supported expressivity of grammars increases as you go down the list. Techniques like LL don’t support left-recursion or ambiguity; LR supports left-recursion, but not ambiguity; and generalized supports anything that’s context-free. Note that this is a very rough arrangement. It’s difficult to formally analyze the non-commutative LL(*) techniques, and so theorists tend to be a little unclear as to exactly how powerful the techniques are with respect to more well defined classes like LL and LR. However, it is generally assumed that non-commutative LL(*) is strictly more powerful than LL(k) but likely less powerful than LR(k) (since left-recursion can be handled with memoization, but some LR local ambiguities do not always resolve correctly).
As intuition would suggest, algorithms are generally more complex (both in terms of comprehension and asymptotic performance) the more powerful you get. LL(k) algorithms, both the table-driven and the directly-encoded, are usually quite easy to understand. Parser states correspond directly to grammatical rules, and so it’s usually pretty easy to tease out the structure of the parser. By contrast, LR(k) algorithms (most commonly, tabular LALR and recursive-ascent) are usually very difficult to conceptualize and next to impossible to read when encoded in a programming language. One look at the recursive-ascent example on Wikipedia is sufficient to confirm this property.
Most listed non-generalized parsing techniques are O(n) in the length of the input. The one exception here is non-commutative LL(*), which is O(k^n) in the case where the grammar is recursively ambiguous and the input is invalid. Generalized parsing on the other hand has an absolute lower-bound of o(n^2) (a property which falls out of the equivalence with matrix multiplication), though no one has ever found an algorithm which can do better than O(n^3). Clearly, generalized parsing does impose a performance penalty beyond more “conventional” techniques.
For these reasons, generalized parsing is usually confined to applications which actually need to be able to handle the full set of context-free grammars — most notably, genome analysis and natural language processing. Even worse, the reticence surrounding generalized parsing has led to its avoidance in several less esoteric situations which would benefit greatly from its power — most notably, the Scala, Haskell and C/C++ compilers should use generalized parsing, but don’t.
This is really a shame, because generalized parsing benefits from two major areas of advancement in the past several decades: CPU clock speed and algorithmic improvements. The bias against generalized parsing dates back to a time when processors were slow enough that the difference between O(n) and O(n^3) was fairly significant, even on shorter input strings. It also predates several newer algorithms which encode the full power of generalized parsing in clean, elegant and understandable ways.
It’s a prejudice, plain and simple, and I plan to do something about it. In this series of articles, I will explain some of the fundamental techniques used to make generalized parsing fast and efficient, particularly as they relate to a newer algorithm known as “generalized LL” (GLL). I’ll give a brief outline of how this algorithm is implemented and how it can be easily used in Scala via the gll-combinators framework. Additionally, I will provide some motivation for why generalized parsing is so important, particularly in light of the modern trend in language design toward more and more powerful syntax. Even if you don’t agree with my conclusions, I hope you will come away from this series with a more complete understanding of the state of generalized parsing and how it can be effectively applied.


Comments
1) ANTLR is LL(*) and has a memoize option
2) How good is the error recovery and reporting?
Interesting article. I look forward to the rest of the series.
@Christian
You’re right, ANTLR is LL(*). I was thinking of the older versions, which were tabular LL(k). The new versions generate recursive-descent parsers, which are non-commutative LL(*). ANTLR fakes commutativity though by imposing restrictions on conditionals during pre-analysis.
I welcome the series. I would, however, point out that parsing speed is still highly relevant on modern fast machines, not least because code bases are far larger today, and we ask more of our parsers, such as IDE support, editor insight, refactoring, etc. JITted languages targeting platforms like the JVM and .NET spend less time in code generation and more time in the straight-up I/O transform of structured text into structured binaries.
I don’t know about the code trees you work with, but where I work a clean build takes between 8 and 20 minutes, depending on how fast the machine is and whether it’s running on an SSD or not. But CPU speed is the dominant factor, and parsing the largest component of time.
A nice small GLR parser that I’ve used:
http://dparser.sourceforge.net/
… worth trying out.
@Barry
If the dominant performance factor in your compiler is your parser, then you’re using the wrong algorithm. Seriously! It doesn’t matter what you’re parsing, there is no reason that the parser should take longer than the rest of the semantic analysis, to say nothing of code generation itself. The only way the parser could be taking that long is if the algorithm is bad (e.g. unmemoized non-commutative LL(*)) or if the implementation has a high constant factor (e.g. a naive encoding of recursive-ascent). Even then, I’m very surprised that the parser could possibly be such a large performance factor.
The thing is that parsing a context-free language is (in the worst case) O(n^3) with modern algorithms. That worst case assumes tightly-recursive ambiguity, so it’s next to impossible that your input stream and grammar are quite that bad. More likely, your language is probably parsable in O(n), assuming your parser’s algorithm choice isn’t horribly broken. O(n) sounds like a lot on a large input file, but keep in mind that the semantic analysis phase of a compiler requires at least one, generally *multiple* passes over the resulting AST, each of which is at least O(n), and usually something even more. For example, type checkers in basic object-oriented compilers tend to be O(n log n).
And yes, I have done a lot of practical benchmarking with absolutely enormous data sets (approaching millions of lines). A good parsing algorithm can chew through such data in less than a second, even on comparatively older hardware.
It’s worth mentioning that the Novell Pulse code base takes me about 10 minutes to compile on a 2.66 Ghz Core i7 with an SSD. Most of that time is spent compiling our (moderately-sized) Scala codebase, however if I were to break out the profiler, I doubt I would see the parser as the main culprit. Yes, Scala’s parser was horrible, inefficient and slow in 2.7 (it’s still horrible in 2.8, but at least it’s fast), but the type checker is really what hogs all of the resources. Scala’s type checking algorithm is ridiculously complicated, requiring multiple backwards and forwards passes, building and traversal of data structures, etc. So regardless of how long the parser takes, its runtime is dwarfed by that of the type checker.
Post a Comment