Skip to content
Print

Unveiling the Mysteries of GLL Part 1: Welcome to the Field

31
May
2010

Generalized parsing is probably the most misunderstood topic in the entire field of automated language processing. There is a persistent perception that generalized parsing is slow and impractical. Even worse, most people seem to believe that generalized parsing is complicated and unpredictable (a perception deriving from the extremely obfuscated nature of most generalized parsing algorithms). This is all very unfortunate, because none of it is true anymore.

Now, before I move forward and justify that rather bold statement, I should probably define what I mean when I say “generalized parsing”. Parsing algorithms generally fall into one of any number of categories. The most common ones are:

  • LL(k) (e.g. ANTLR, JavaCC)
  • Non-Commutative LL(*) (e.g. most hand-written parsers, parser combinators)
  • Memoized Non-Commutative LL(*) (e.g. Packrat parsers, Scala 2.8 parser combinators)
  • LR(k) (e.g. YACC, Bison)
  • Generalized

These are arranged roughly in order from least to most powerful. This means that the supported expressivity of grammars increases as you go down the list. Techniques like LL don’t support left-recursion or ambiguity; LR supports left-recursion, but not ambiguity; and generalized supports anything that’s context-free. Note that this is a very rough arrangement. It’s difficult to formally analyze the non-commutative LL(*) techniques, and so theorists tend to be a little unclear as to exactly how powerful the techniques are with respect to more well defined classes like LL and LR. However, it is generally assumed that non-commutative LL(*) is strictly more powerful than LL(k) but likely less powerful than LR(k) (since left-recursion can be handled with memoization, but some LR local ambiguities do not always resolve correctly).

As intuition would suggest, algorithms are generally more complex (both in terms of comprehension and asymptotic performance) the more powerful you get. LL(k) algorithms, both the table-driven and the directly-encoded, are usually quite easy to understand. Parser states correspond directly to grammatical rules, and so it’s usually pretty easy to tease out the structure of the parser. By contrast, LR(k) algorithms (most commonly, tabular LALR and recursive-ascent) are usually very difficult to conceptualize and next to impossible to read when encoded in a programming language. One look at the recursive-ascent example on Wikipedia is sufficient to confirm this property.

Most listed non-generalized parsing techniques are O(n) in the length of the input. The one exception here is non-commutative LL(*), which is O(k^n) in the case where the grammar is recursively ambiguous and the input is invalid. Generalized parsing on the other hand has an absolute lower-bound of o(n^2) (a property which falls out of the equivalence with matrix multiplication), though no one has ever found an algorithm which can do better than O(n^3). Clearly, generalized parsing does impose a performance penalty beyond more “conventional” techniques.

For these reasons, generalized parsing is usually confined to applications which actually need to be able to handle the full set of context-free grammars — most notably, genome analysis and natural language processing. Even worse, the reticence surrounding generalized parsing has led to its avoidance in several less esoteric situations which would benefit greatly from its power — most notably, the Scala, Haskell and C/C++ compilers should use generalized parsing, but don’t.

This is really a shame, because generalized parsing benefits from two major areas of advancement in the past several decades: CPU clock speed and algorithmic improvements. The bias against generalized parsing dates back to a time when processors were slow enough that the difference between O(n) and O(n^3) was fairly significant, even on shorter input strings. It also predates several newer algorithms which encode the full power of generalized parsing in clean, elegant and understandable ways.

It’s a prejudice, plain and simple, and I plan to do something about it. In this series of articles, I will explain some of the fundamental techniques used to make generalized parsing fast and efficient, particularly as they relate to a newer algorithm known as “generalized LL” (GLL). I’ll give a brief outline of how this algorithm is implemented and how it can be easily used in Scala via the gll-combinators framework. Additionally, I will provide some motivation for why generalized parsing is so important, particularly in light of the modern trend in language design toward more and more powerful syntax. Even if you don’t agree with my conclusions, I hope you will come away from this series with a more complete understanding of the state of generalized parsing and how it can be effectively applied.

Comments

  1. 1) ANTLR is LL(*) and has a memoize option
    2) How good is the error recovery and reporting?

    Christian Reisinger Monday, May 31, 2010 at 4:57 am
  2. Interesting article. I look forward to the rest of the series.

    Henry Ware Monday, May 31, 2010 at 7:03 am
  3. @Christian

    You’re right, ANTLR is LL(*). I was thinking of the older versions, which were tabular LL(k). The new versions generate recursive-descent parsers, which are non-commutative LL(*). ANTLR fakes commutativity though by imposing restrictions on conditionals during pre-analysis.

    Daniel Spiewak Monday, May 31, 2010 at 7:15 am
  4. I welcome the series. I would, however, point out that parsing speed is still highly relevant on modern fast machines, not least because code bases are far larger today, and we ask more of our parsers, such as IDE support, editor insight, refactoring, etc. JITted languages targeting platforms like the JVM and .NET spend less time in code generation and more time in the straight-up I/O transform of structured text into structured binaries.

    I don’t know about the code trees you work with, but where I work a clean build takes between 8 and 20 minutes, depending on how fast the machine is and whether it’s running on an SSD or not. But CPU speed is the dominant factor, and parsing the largest component of time.

    Barry Kelly Monday, May 31, 2010 at 6:20 pm
  5. A nice small GLR parser that I’ve used:

    http://dparser.sourceforge.net/

    … worth trying out.

    Ladd Monday, May 31, 2010 at 8:55 pm
  6. @Barry

    If the dominant performance factor in your compiler is your parser, then you’re using the wrong algorithm. Seriously! It doesn’t matter what you’re parsing, there is no reason that the parser should take longer than the rest of the semantic analysis, to say nothing of code generation itself. The only way the parser could be taking that long is if the algorithm is bad (e.g. unmemoized non-commutative LL(*)) or if the implementation has a high constant factor (e.g. a naive encoding of recursive-ascent). Even then, I’m very surprised that the parser could possibly be such a large performance factor.

    The thing is that parsing a context-free language is (in the worst case) O(n^3) with modern algorithms. That worst case assumes tightly-recursive ambiguity, so it’s next to impossible that your input stream and grammar are quite that bad. More likely, your language is probably parsable in O(n), assuming your parser’s algorithm choice isn’t horribly broken. O(n) sounds like a lot on a large input file, but keep in mind that the semantic analysis phase of a compiler requires at least one, generally *multiple* passes over the resulting AST, each of which is at least O(n), and usually something even more. For example, type checkers in basic object-oriented compilers tend to be O(n log n).

    And yes, I have done a lot of practical benchmarking with absolutely enormous data sets (approaching millions of lines). A good parsing algorithm can chew through such data in less than a second, even on comparatively older hardware.

    It’s worth mentioning that the Novell Pulse code base takes me about 10 minutes to compile on a 2.66 Ghz Core i7 with an SSD. Most of that time is spent compiling our (moderately-sized) Scala codebase, however if I were to break out the profiler, I doubt I would see the parser as the main culprit. Yes, Scala’s parser was horrible, inefficient and slow in 2.7 (it’s still horrible in 2.8, but at least it’s fast), but the type checker is really what hogs all of the resources. Scala’s type checking algorithm is ridiculously complicated, requiring multiple backwards and forwards passes, building and traversal of data structures, etc. So regardless of how long the parser takes, its runtime is dwarfed by that of the type checker.

    Daniel Spiewak Monday, May 31, 2010 at 9:13 pm
  7. Hello. The following lines in your entry

    > Non-Commutative LL(*) (e.g. most hand-written parsers, parser combinators)
    > Memoized Non-Commutative LL(*) (e.g. Packrat parsers, Scala 2.8 parser combinators)

    are not true. Scala’s parser combinators and packrat parsers are based on Parsing
    Expression Grammar (PEG) and it is *different* from LL(*), which is used in ANTLR.
    PEGs are strictly more powerful than deterministic LR(K) languages and include
    some non-context free languages. Also, a language that a PEG can express can be
    parsed in O(n) by memoization. I advise to read Bryan Ford’s paper “Parsing Expression Grammars: A Recognition-Based Syntactic Foundation” and “Packrat Parsing: Simple, Powerful, Lazy, Linear Time”.

    Kota Mizushima Friday, October 22, 2010 at 5:56 pm
  8. @Kota

    Notice that I specified *non-commutative* LL(*). This is very different from conventional LL(*), which is what ANTLR uses. Non-commutative LL(*) is defined as single-trail top-down parsing with infinite backtracking where choice is not commutative. PEGs are a literal representation of that idea, and parser combinators are a functional version of PEGs. *Every* parser combinator framework is equivalent in power to PEGs.

    > PEGs are strictly more powerful than deterministic LR(K) languages and include
    some non-context free languages.

    This is a somewhat amusing statement, partially because it’s trivially disproven:

    S ::= A b
    A ::= a | a a

    Consider using this grammar to parse the input “aab“. PEGs (and parser combinators) will fail because they will reduce on the first choice of the A non-terminal *before* they get to the second input a, resulting in the wrong offset in S and a failed parse. I think it’s pretty obvious that LR(1) can handle this case, so I won’t go through the full state machine.

    While it *is* true that PEGs can handle *some* grammars which cannot be handled by LR(k), the is not *strictly* more powerful. PEGs can handle certain kinds of ambiguity (e.g. the dangling else problem), and they are very good at handling context-sensitive grammars (though it’s worth noting that context-sensitive parses cannot be memoized in general, so they still require O(k^n) time). However, the class of languages which are deterministic (and thus, can be handled by LR(k), for some finite k) but which *cannot* be handled by PEGs or Packrat is very, very large indeed.

    The truth is that the theory of PEGs (and by extension, parser combinators) is not well understood. It’s very difficult to reason about them due to the non-commutativity of choice. We don’t even have to bring in context-sensitivity to muddy the waters!

    I have read the papers you suggested, several times in fact. I suggest you re-read them, because they absolutely do not make the claim that PEG is strictly more powerful than LR(k).

    Daniel Spiewak Saturday, October 23, 2010 at 7:02 am
  9. Thanks for reply. I understood that PEGs are not confounded with LL(*). However, there
    are some mistake yet.

    First, you have misconception about parser combinator. The term “parser combinator”
    itself doesn’t specify any particular parsing algorithm. “parser combinator” is just a
    technique to make parsers from smaller parsers. For example, Parsec without “try”
    operator is just about LL(1), which is weaker than LL(*). Also, parser combinator can
    express ambiguity by returning parsing results as lists.

    Second, grammar are confound with language in your comment.

    > This is a somewhat amusing statement, partially because it’s trivially disproven:

    > S ::= A b
    > A ::= a | a a

    > Consider using this grammar to parse the input “aab“. PEGs (and parser combinators) will fail because they will reduce on the first choice of the A non-terminal *before* they get to the second input a, resulting in the wrong offset in S and a failed parse. I think it’s pretty obvious that LR(1) can handle this case, so I won’t go through the full state machine.

    It is not disprove because I said not about *grammar* but *language*. The language
    expressed by the above BNF can be expressed by the following PEG easily:

    S <- A b
    A While it *is* true that PEGs can handle *some* grammars which cannot be handled by LR(k), the is not *strictly* more powerful. PEGs can handle certain kinds of ambiguity (e.g. the dangling else problem), and they are very good at handling context-sensitive grammars (though it’s worth noting that context-sensitive parses cannot be memoized in general, so they still require O(k^n) time). However, the class of languages which are deterministic (and thus, can be handled by LR(k), for some finite k) but which *cannot* be handled by PEGs or Packrat is very, very large indeed.

    PEGs can express *any* deterministic LR(k) *language* according to Ford’s paper ““Parsing Expression Grammars: A Recognition-Based Syntactic Foundation”. the following is relevant point:

    “A final open problem is the relationship and inter-convertibility of CFGs and PEGs. Birman proved that TS and gTS can simulate any deterministic pushdown automata (DPDA) [5], implying that PEGs can express any deterministic LR-class context-free language.”

    > The truth is that the theory of PEGs (and by extension, parser combinators) is not well understood. It’s very difficult to reason about them due to the non-commutativity of choice. We don’t even have to bring in context-sensitivity to muddy the waters!

    Yes. However, basic properties of Parsing Expression Languages(PELs), which is the class of languages expressed by PEGs is described by Ford.

    Kota Mizushima Saturday, October 23, 2010 at 8:42 am
  10. Sorry. There is a eratta in my previous comment.

    It is not disprove because I said not about *grammar* but *language*. The language
    expressed by the above BNF can be expressed by the following PEG easily:

    eratta:
    > S A

    right:
    > S A <- aa / a

    Kota Mizushima Saturday, October 23, 2010 at 8:45 am
  11. I’am sorry again. I forgot to use pre tags. Here is the right version.

    eratta:

    S <- A b
    A

    right:

    S <- A b
    A <- aa / a

    Kota Mizushima Saturday, October 23, 2010 at 8:51 am
  12. @Kota

    Sorry for the long delay on replying back; busy busy… I’ll start with your central point and work outward:

    > PEGs can express *any* deterministic LR(k) *language* according to Ford’s paper ““Parsing Expression Grammars: A Recognition-Based Syntactic Foundation”. the following is relevant point[...]

    This is an interesting result. I dug up the paper and re-read that section. I had missed the footnote about gTS emulating any DPDA, which is clearly an important result. I think it’s fairly clear that any PEG can be reduced to gTS, so that does certainly imply that the class of PELs is at least as wide as LR(k). Interesting…

    In my defense though, I wasn’t talking so much about the theoretical languages which can be parsed but rather the grammars. Examining language classes, while interesting in a theoretical sense, is usually not *as* interesting in the practical sense. For example, any deterministic context-free language (for which there exists a DPDA) has a CFG which is LR(k). It is possible to automatically transform that grammar into a form which is LR(1), which is of course, much easier to parse (hence, techniques like LALR). However, the transformations required here are often monstrous to perform by hand and the resulting grammar can be very difficult to maintain. I’m not sure the tool user is at all interested that the LR(1) technique is just as powerful as LR(k); they’re probably more concerned about the immense contrivances required to make their problem expressible.

    So, you’re right about PEGs being strictly more powerful than LR(k) parsers, but my contention is that it doesn’t matter so much.

    > First, you have misconception about parser combinator. The term “parser combinator”
    itself doesn’t specify any particular parsing algorithm. “parser combinator” is just a
    technique to make parsers from smaller parsers. For example, Parsec without “try”
    operator is just about LL(1), which is weaker than LL(*). Also, parser combinator can
    express ambiguity by returning parsing results as lists.

    Parser combinators almost always imply an algorithm, specifically recursive-descent (i.e. PEG parsing). This is because it is very difficult to create composable grammars using any other algorithm. It’s *possible*, but often quite contrived and very ugly. As far as I know, all mainstream parser combinator tools use recursive-descent or memoized recursive-descent.

    At any rate, I concede the point about language classes (clearly, since there is a proof which contradicts my earlier statement), though I maintain that the contrivances required to actually *express* an LR(k) grammar in terms of a PEG would be ridiculously impractical. :-)

    Daniel Spiewak Saturday, October 30, 2010 at 4:37 am
  13. Hiya. Nice article.

    If I may, I’d like to claim LL(*) as the name of the specific parsing strategy (not analysis algorithm mind you) that ANTLR uses (predicated cyclic lookahead DFA). Fairly well established in the books and website for 7 years now. LL(*) as a generic and very loose term for anything beyond LL(k) is probably not a great idea; it muddies things. Better to be very specific about what’s going on. packrat, GLL, full backtracking, etc… LL(*) in my usage is an optimization of a packrat parser but they are equivalent in power. By adding semantic predicates, LL(*) goes beyond vanilla packrat parsing / PEGs.

    Also, I recommend saying “ordered” not “non-commutative”; non-commutative is much less clear and isn’t used much.

    With memoization backtracking parsers are O(n).

    As some readers have pointed out, it’s important to distinguish between expressivity/power of a technique from how it handles a specific grammar.

    Note that ANTLR orders alternatives as well to resolve ambiguities; you might have a mistake above.

    Finally, the conclusion from a recently submitted paper (from me and Kathleen Fisher):

    “LL(*) parsers are as expressive as PEGs and beyond due to semantic predicates. While GLR accepts left-recursive grammars, it cannot recognize context-sensitive languages as LL(*) can. Unlike PEGs or GLR, LL(*) parsers enable arbitrary action execution and provide good support for debugging and error handling. The LL(*) analysis algorithm constructs cyclic lookahead DFA to handle non-LL(k) constructs and then fails over to backtracking via syntactic predicates when it fails to find a suitable DFA. Experiments reveal that ANTLR generates efficient parsers, eliminating almost all backtracking.”

    Terence Parr Sunday, December 5, 2010 at 1:59 pm
  14. > If I may, I’d like to claim LL(*) as the name of the specific parsing strategy

    I can’t say that I agree. LL(*) to my mind refers more to a class of grammar than a specific algorithm. It is a class very distinct from LL(k), since for any finite k, there will be some grammar which requires a lookahead of k + 1. Thus, the class of grammars which is described by LL(*) is strictly greater than LL(k) and very much an interesting category in and of itself.

    Critically, there are several algorithms which are capable of recognizing LL(*), though clearly a backtracking top-down, depth-first walk is the most direct.

    > Fairly well established in the books and website for 7 years now.

    In my experience, the term as applied to a particular algorithm is *not* established as such in academic work. The exception being work focusing on ANTLR and friends. So, ANTLR a bit of an ecosystem unto itself. :-) That’s good in many ways, but it does mean that the terminology tends to be a bit different.

    This is of course, just based on the work which I have examined and contributed to in the parsing community. So, maybe I’m mistaken here…

    > By adding semantic predicates, LL(*) goes beyond vanilla packrat parsing / PEGs.

    Ish. I have a fairly fundamental disagreement with the ANTLR philosophy of applying semantic predicates to extend the capabilities of the parsing algorithm. This muddies the formal model to the point of near uselessness, and carries the danger of corrupting the declarative nature of the grammar. Obviously, like any tool, it can be used correctly, but I feel that in the case of parsing, formal rigor is more important than convenient escape hatches.

    Now, to put that opinion in perspective, I’m not even a fan of Bison’s precedence declarations, so maybe I’m a bit more picky than the average user. :-) This is why I try to make generalized parsing easier, more efficient and more directly understood. It’s really the only way to get the required practical power from a parser without corrupting the formalism.

    > Also, I recommend saying “ordered” not “non-commutative”; non-commutative is much less clear and isn’t used much.

    Fair point. I like “non-commutative” since it makes the semantics of the union combinator extremely explicit, but “ordered” is fine too.

    > With memoization backtracking parsers are O(n).

    With memoization, backtracking top-down parsing becomes Packrat parsing (unless you’re doing something creatively out of the box). As I point out in the article, memoized infinite-backtracking parsing is strictly more powerful than standard top-down backtracking parsing (in areas like left-recursion). To be clear, it doesn’t accept a larger class of *languages*, just a larger class of grammars.

    > Note that ANTLR orders alternatives as well to resolve ambiguities; you might have a mistake above.

    I didn’t know that. Actually, I find that quite disturbing. I really dislike the silent “resolution” of ambiguities based on nothing more than ordering. It’s a reasonable behavior, but it surprises the heck out of me every time I use a tool which does it.

    > LL(*) parsers are as expressive as PEGs and beyond due to semantic predicates.

    I’m going to take issue with this statement. Semantic predicates (as I understand them) are an addition to the model. Critically, the resulting automaton is no longer theoretically an extended DFA or even a PDA, but actually a full Turing Machine. This is because semantic predicates allow you to just call out into arbitrary functions which do their own nasty things and mess with the flow of your parser in surprising ways. As such, I don’t think it’s fair to say that semantic predicates lend any more power to a particular parsing algorithm, they’re just a very practical corruption of the automaton by serving as hooks into arbitrary actions.

    > Unlike PEGs or GLR, LL(*) parsers enable arbitrary action execution

    I’m not sure what you mean by this. PEGs allow actions at arbitrary points in the parse. If they don’t, then the implementation is at fault, not the algorithm.

    I do agree that GLR is needlessly restrictive though. Bottom-up parsing is clever, but quite murky in practice. It also falls over in the face of things like hidden left-recursion (e.g. `S -> A S x` where A is nullable). There are far better generalized parsing algorithms though (GLL for one, and even derivative parsing is quite a bit nicer).

    Daniel Spiewak Sunday, December 5, 2010 at 2:25 pm
  15. > I can’t say that I agree. LL(*) to my mind refers more to a class of grammar than a specific algorithm.

    In 28 years of building parsers and 20 years reading the academic literature, I’ve only seen LL(*) once or twice; it was referring to backtracking parsers in the comp.compilers newsgroup. I’ve never seen it other than that except loosely occasionally as you’ve got here.

    > It is a class very distinct from LL(k), since for any finite k, there will be some grammar which requires a lookahead of k + 1. Thus, the class of grammars which is described by LL(*) is strictly greater than LL(k) and very much an interesting category in and of itself.

    Yes, but you are not saying what LL(*) means to you. As I have precisely defined it in the recent paper and used it in books and on the web, LL(*) is essentially an implementation of LL-regular grammar parsing with the extension of predicates for backtracking and context-sensitive parsing.

    > Critically, there are several algorithms which are capable of recognizing LL(*), though clearly a backtracking top-down, depth-first walk is the most direct.

    What are you saying LL(*) means precisely? What exactly is the class?

    > In my experience, the term as applied to a particular algorithm is *not* established as such in academic work.

    I just fixed that (academically). I have never seen it used in academic work. Can you point me at a paper?

    > The exception being work focusing on ANTLR and friends. So, ANTLR a bit of an ecosystem unto itself. That’s good in many ways, but it does mean that the terminology tends to be a bit different.

    Since no one else is referring to it (other than you I guess ;) ), do you really object to ANTLR laying claim to the term?

    > By adding semantic predicates, LL(*) goes beyond vanilla packrat parsing / PEGs.

    &gt. Ish. I have a fairly fundamental disagreement with the ANTLR philosophy of applying semantic predicates to extend the capabilities of the parsing algorithm.

    Have you tried to parse C without using symbol table information to direct the parse?

    > This muddies the formal model to the point of near uselessness, and carries the danger of corrupting the declarative nature of the grammar. Obviously, like any tool, it can be used correctly, but I feel that in the case of parsing, formal rigor is more important than convenient escape hatches.

    Agreed that it can easily be abused and I try to avoid them. When you need context-sensitive parsing, however, there is no way around it.

    > I didn’t know that. Actually, I find that quite disturbing. I really dislike the silent “resolution” of ambiguities based on nothing more than ordering. It’s a reasonable behavior, but it surprises the heck out of me every time I use a tool which does it.

    It only does this if you tell it that it can backtrack when it fails to find a suitable DFA. PEGs are by definition ordered like this so you must really hate those.

    Please note that PEGs derive their ordering from a welcome but unanticipated side effect of the syntactic predicates (a term that I coined) in an earlier version of ANTLR.

    > I’m going to take issue with this statement. Semantic predicates (as I understand them) are an addition to the model. Critically, the resulting automaton is no longer theoretically an extended DFA or even a PDA, but actually a full Turing Machine.

    Yup. Just because it muddies up the semantics, doesn’t mean that they are necessary in some circumstances.

    > This is because semantic predicates allow you to just call out into arbitrary functions which do their own nasty things and mess with the flow of your parser in surprising ways. As such, I don’t think it’s fair to say that semantic predicates lend any more power to a particular parsing algorithm, they’re just a very practical corruption of the automaton by serving as hooks into arbitrary actions.

    They are hooks really into side effect free expressions that test symbol table information so that we can do context-sensitive parsing. The only other choice is to allow ambiguous parsing and then walk trees to resolve the ambiguity ala GLR. You cannot escape executing a symbol table test to distinguish between two constructs, whether they are token streams are subtrees.

    >> Unlike PEGs or GLR, LL(*) parsers enable arbitrary action execution

    > I’m not sure what you mean by this. PEGs allow actions at arbitrary points in the parse. If they don’t, then the implementation is at fault, not the algorithm.

    Because backtracking parsers are constantly speculating, they either must execute actions that are undoable or they must avoid actions altogether. With a backtracking parser, you can’t even announce a syntax error until you’ve read the entire input. Speculation is the enemy of practical parsing ;) There is no way to execute a print statement while speculating, for example. ANTLR gives you the power of PEGs but with a much more flexible system for executing actions because it doesn’t speculate so much. ANTLR parses without action execution and then rewinds to do it again with feeling which it finds the appropriate alternative.

    Terence Parr Sunday, December 5, 2010 at 3:02 pm
  16. > What are you saying LL(*) means precisely? What exactly is the class?

    I would define it as the class of all grammars which can be recognized by (well, used to define a recognizer by) a speculative top-down algorithm. Ok, so that was basically just defining the class of grammars based on the algorithm which you claim it defines, so maybe not so helpful… :-)

    My brain is too muddied at the moment to come up with a properly formal definition, but I was thinking of starting along the lines of defining kth-order predict sets, showing that for any finite k, there is some (k + 1)th-order predict set which defines a larger class of languages. We then let k go to infinity (or practically, n, where n is the input length). This would define the class of grammars, LL(*).

    The reason I don’t like to say that LL(*) refers to a particular algorithm is because LL(k) is algorithm-agnostic. Granted, it usually refers to predictive top-down parsing, but it’s really a more general class of grammars. Even more importantly though, if we say that LL(*) is fixed as a particular algorithm, then we need to invent new terminology to describe other algorithms which recognize precisely the LL(*) grammars (such as PEG, which as you point out, is somewhat different from what ANTLR is doing).

    > I just fixed that (academically). I have never seen it used in academic work. Can you point me at a paper?

    I was thinking about that, and actually I can’t. I have an excuse, because it’s really hard to point someone at a paper which demonstrates the absence of a consensus on something, but still. :-) Since I can’t produce a witness, I’ll back away from my claim that LL(*) has a standard meaning, though I still maintain that in conversations with other parsing experts, the term has been used with no ambiguity in meaning.

    > Since no one else is referring to it (other than you I guess ;) ), do you really object to ANTLR laying claim to the term?

    Yep. :-) As mentioned above, LL(k) is algorithm-agnostic, it feels like LL(*) should be as well. I think it’s important that we have clean and general handles for precise grammatical classes. This makes it easier to classify algorithms. If LL(*) refers to exactly and only the algorithm used by ANTLR, then it becomes harder to talk about the class of grammars handled by PEGs, Packrat parsers, confined derivative parsers, and the myriad of other unforeseen techniques which may deal in precisely that space.

    > Have you tried to parse C without using symbol table information to direct the parse?

    A very good example. Haskell also shares this problem, as does other, more modern languages like Groovy and Scala (though both of these resolve their context-sensitivity by arbitrary precedence).

    I tend to favor generalized parsing to solve this problem. You look at ambiguities like `a * b` in C/C++ and conclude that we need a context-sensitive parser. I look at such terms and conclude that we need a parser which gives us both possibilities and allows us to disambiguate outside the parse. Without this separation of concerns, it can become very difficult to figure out exactly what your parser is doing, since it’s clearly not confined to syntactic analysis anymore.

    I will admit though that current tools for generalized parsing are a little difficult to work with. ASF+SDF has done a good job (I think) of pushing the envelope on disambiguation filters and similar, making it easier to declaratively handle things like C or Haskell without imposing context on the parse, but it’s still an open research problem.

    It’s also worth noting that there are some generalized parsing algorithms which allow context-sensitivity, if you absolutely *must* have it (e.g. parsing Python becomes a lot easier with context). My GLL combinators library allows you to define context-sensitive parsers if you’re careful, but clearly the O(n^3) bounds go out the window when you mess around with tricks like that.

    > Agreed that it can easily be abused and I try to avoid them. When you need context-sensitive parsing, however, there is no way around it.

    And so the solution would be to not need context-sensitive parsing, right? ;-)

    > Please note that PEGs derive their ordering from a welcome but unanticipated side effect of the syntactic predicates (a term that I coined) in an earlier version of ANTLR.

    Hmm, my understanding was that PEGs date from the era of LL(k) ANTLR, which didn’t (to my knowledge) have ordering. Then again, I didn’t know that the modern ANTLR had ordering, and my knowledge of the earliest origins of PEG is a little bit fuzzy.

    > Yup. Just because it muddies up the semantics, doesn’t mean that they are necessary in some circumstances.

    Only because the LL(*) algorithm (you’ve even got me saying it now!) cannot handle ambiguities except by local disambiguation (through ordering). Generalized parsers solve this problem by allowing non-local disambiguation. ANTLR solves this problem by bringing more information into the local disambiguation scope (context-sensitivity). They’re too different solutions to the same problem, but it’s a bit unfair to claim that either one of them is unavoidable.

    > They are hooks really into side effect free expressions…

    Ok, so not as horrific as I was fearing. :-) As long as they’re side-effect free, I can at least live with them. Well, I won’t start a crusade, anyway.

    > The only other choice is to allow ambiguous parsing and then walk trees to resolve the ambiguity ala GLR.

    Indeed. This is the choice which I prefer since it keeps the parser spec declarative and formally tractable. Perhaps even more importantly, it cleanly separates syntactic and semantic analysis.

    > Because backtracking parsers are constantly speculating, they either must execute actions that are undoable or they must avoid actions altogether.

    Actually, there’s a third option: lazy evaluation of semantic actions. I don’t think Parsec supports this, but in theory it wouldn’t be too difficult. Basically, the semantic actions get chained along and invoked in a giant mass at the end of the day. This has the pleasant property of also allowing semantic actions at arbitrary points even in a bottom-up parse, which has been a long-standing problem with techniques like LALR. There hasn’t been a lot of research in this direction (it’s on my list), though an unpublished paper by Might and Darais (http://arxiv.org/abs/1010.5023) uses lazy reduction to allow breadth-first top-down parsing (something which is inherently speculative).

    > Speculation is the enemy of practical parsing

    Well, it’s the enemy of parsing with side-effecting semantic actions, anyway. I think you’d be hard-pressed to argue that side-effects are a necessity for practical parsing, particularly when the parser most often produces an AST, something which can be done in a purely-declarative fashion.

    > There is no way to execute a print statement while speculating…

    Do you do that often? ;-) Seriously, there are so many better ways of debugging a parser than issuing print statements inside a semantic action. My personal opinion is that side-effecting semantic actions cause an abstraction leak. Users should not have to care what magic the parser performs to invoke the right semantic actions and produce the appropriate result. They should just trust the algorithm. However, as soon as you start side-effecting in your semantic actions, you’re causing an algorithm detail (semantic invocation order) to leak outside of the parser. This frustrates algorithm designer (who must be careful to preserve invocation order) and ultimately misleads and confuses the user, who by definition doesn’t want to know about what the parser is doing (if they did want to know, they would be writing it by hand).

    > ANTLR parses without action execution and then rewinds to do it again with feeling which it finds the appropriate alternative.

    Clever! Scary, but clever. It seems weird that ANTLR would backtrack all the way to the beginning after reaching the end, but it doesn’t really affect the asymptotic bounds, and clearly it wouldn’t have a significant constant-time impact either (particularly if that allows you to make some optimizations).

    Ironically, this is precisely the sort of trick which makes me advocate side-effect free semantic actions! You’re doing something extremely unintuitive with the parsing algorithm control flow. The user shouldn’t even know that you’re doing that. However, if ANTLR’s semantic actions allow side-effects (don’t they?), then you probably had to go a bit out of your way to preserve the *illusion* of a conventional parse trail in the face of your shenanigans under the surface.

    Daniel Spiewak Sunday, December 5, 2010 at 3:43 pm
  17. nice having a parsing conversation! Don’t get to do that very often. :)

    Anyway, I have to go jump in the pool to maintain my corporal existence. But, a couple of responses.

    > LL(k) is algorithm-agnostic

    I thought that an LL(k) grammar was specifically a grammar for which there exists an LL(k) parser? And those parsers are very well-defined. Or do you mean state table versus recursive descent and all that kind of implementation detail? I define LL(*) not in terms of implementation but in terms of a DFA; you could even build a recursive descent like thing for the DFA. I don’t mean to restrict LL(*) to any implementation details. It only refers to the basic mechanism of using a potentially cyclic DFA with predicated edges.

    BTW, by GLL, do you mean as defined by Adrian Johnstone and friends?

    > my understanding was that PEGs date from the era of LL(k) ANTLR

    Well, PEGs are fairly recent but Bryan Ford refers to my early work in 1995 with ordered alternatives as a means of implementing selective backtracking for syntactic predicates.

    > Actually, there’s a third option: lazy evaluation of semantic actions.

    That is harder in Java and other popular commercial languages ;) I have seen people queue up the actions, but then you can’t use parameters to functions etc… Since all of those scopes have disappeared after parsing.

    > I think you’d be hard-pressed to argue that side-effects are a necessity for practical parsing, particularly when the parser most often produces an AST, something which can be done in a purely-declarative fashion.

    In the paper I describe how I manually read 89 ANTLR grammars from source Forge; 75% of them had embedded actions. It’s not at all hard to argue that side effects are necessary for practical parsing. If you won’t take my experienced word for it, consider building a configuration file reader. The easiest thing to do is to insert an action into the grammar that triggers some function in your large application. Building a tree is a completely unnecessary complication. That said, I do try to avoid putting actions inside the parser but you can’t escape executing an action in response to some input whether it’s a token stream or a tree node stream. If you avoid them in the parser, that means you need them in a tree walker. That means I use a tree grammar rather than manually building visitors. I add actions to those grammars. See? Can’t escape ;)

    Referring to your comment about needing print statements, recall that many people are building very simple little DSLs rather than large languages. The easiest way to build a calculator is with print statements in the grammar. In principle, again, I agree with you that keeping things separate is a worthy goal, but not worth it in many everyday cases confronting programmers.

    Regarding its backtracking, ANTLR only does this rewind boogie when it has to backtrack. And, it only backtracks to the start of that decision not the entire input stream.

    > You’re doing something extremely unintuitive with the parsing algorithm control flow.

    There’s no escaping backtracking when you need power in a top-down parser. Backing up and doing it again to execute the actions is not a significant marginal cost.

    In the end, it’s pretty clear with almost 150,000 unique downloads over the past two years according to Google analytics that ANTLR has hit the sweet spot regarding power and the comfort zone of most programmers.

    Ok, got a run. Thanks for the discussion.

    Terence Parr Sunday, December 5, 2010 at 4:38 pm
  18. > nice having a parsing conversation! Don’t get to do that very often. :)

    Indeed! I’ll just throw in a few more remarks in case you feel like replying some more, but I think we’ve more or less reached an understanding.

    > I thought that an LL(k) grammar was specifically a grammar for which there exists an LL(k) parser? And those parsers are very well-defined.

    I think it depends on what you mean by “well-defined”. When I think of an LL(k) parser, I think of a precomputed state table. However, it’s obviously very easy to implement LL(k) using an auto-generated recursive-descent parser with look ahead. In a sense, it’s like the difference between recursive-ascent and a tabular LALR parser. They both handle the same class of grammar, and fundamentally share the same automata, but the raw mechanics are very different.

    I wonder if it’s fair to define grammar class in terms of automaton properties, since it seems that most algorithms which parse equivalent grammatical classes have identical automaton flows (implying that they implicitly share the same fundamental automata). If this is indeed a reasonable way to define grammar class, then your definition of LL(*) would be an acceptable one. I’m just not convinced yet that all automata which parse the same grammar class must necessarily be equivalent (for any single grammar in that class). :-)

    > BTW, by GLL, do you mean as defined by Adrian Johnstone and friends?

    Yes, that’s the one. I tend to forget that there was another GLL a few decades back with very different properties. I think Frost also used the term once or twice in some of his work, though he never went out of his way to define it.

    > …Since all of those scopes have disappeared after parsing.

    Quite so. To do lazy chaining / queuing of semantic actions, you need to do lambda lifting and basically emulate closures. This isn’t *too* painful in Java (anon inner classes work quite nicely), but I agree that it doesn’t fall out too nicely.

    > In the paper I describe how I manually read 89 ANTLR grammars from source Forge; 75% of them had embedded actions.

    Embedded actions are not necessarily side-effecting. I’m a huge fan of semantic actions in general, just so long as they don’t a) side-effect, or b) affect the parser automaton (note that this is distinct from side-effecting). Even still, I would imagine that most of the grammars you looked at did in fact use side-effecting actions. I suspect that this is due more to how the average developer is accustomed to thinking. Most people don’t naturally think in terms of pure functions, so they use side-effects because they don’t see another way (or they are just more comfortable with mutation).

    > That means I use a tree grammar rather than manually building visitors. I add actions to those grammars. See? Can’t escape ;)

    I think we may have slightly different ideals regarding the scope of the parser (in the abstract functional sense). I see the parser as a function which takes an input stream and returns either a set of values or a set of errors. Those values are most often trees, but I’m perfectly willing to admit things like your configuration file example. One-pass processing of a config file doesn’t require side-effecting semantic actions though. See: https://github.com/djspiewak/gll-combinators/blob/master/examples/src/config/ConfigParser.scala

    > In the end, it’s pretty clear with almost 150,000 unique downloads over the past two years according to Google analytics that ANTLR has hit the sweet spot regarding power and the comfort zone of most programmers.

    Fair point. Java is also the most popular programming language in history. ;-) It fills a wonderful mainstream niche, but that doesn’t mean it’s not worth replacing or striving to better. ANTLR is a great tool, and has inspired some very interesting (and worthwhile!) research, but that doesn’t mean I’m not going to do everything I can to push for something more powerful.

    Daniel Spiewak Sunday, December 5, 2010 at 5:01 pm
  19. I can see that the discussion here has run down, but the exchange between Terrance and Daniel actually reduced my confidence in GLL as a tool, and I think the entire discussion is missing a significant point.

    Daniel is arguing that purity of grammar analysis and modeling is more important than practical utility. To achieve that, he’s willing to push complexity into another phase entirely. One that requires a fair pile of code. Code that (like all code) will carry at least one bug for every seven lines. Terrence has been more on the “practical tool” end of the spectrum, though of course he’s well aware of the underlying theory. Speaking as someone who has built or maintained several compilers, a parser generator’s primary purpose is to let someone get a job done in a way that delivers the best result to the ultimate user of the software. When you are stuck dealing with an ad hoc language, protecting the purity of the grammar beyond reason is a waste of time. The typename resolution hack in C is the right thing to do because it produces the most robust solution in the field. That requires something comparable to predicates. What’s unfortunate is that predicates, once admitted into the tool, can be abused. That’s the price of expressive power.

    The point that I think is getting missed in this discussion is design of new languages. One of the best reasons to develop a new language using a parser generator is the ability of the tool to detect things like parse ambiguities. This is also a reason to restrict new language designs to grammar families in which those kinds of diagnostics are decidable. Daniel is correct that PEG grammars obscure conceptual errors by defining conflicts away (though after you check using unordered choice, implementing using ordered choice is just fine). So I find myself torn. From a “get the job done” perspective, the advances toward more powerful parsing techniques are a good thing. From a “new language design” perspective, the ready availability of overpowered techniques encourages the creation of ambiguous languages whose designers don’t even know they are ambiguous. So yes, Daniel, let’s have more power, but let’s have a way to shut that off when we need to.

    Finally, the objection to operator precedence rules in otherwise conventional grammars seems a bit silly. Precedence doesn’t change the expressive power of the grammar at all. There’s a straightforward rewrite, so precedence rules are just syntactic sugar.

    Jonathan Shapiro Friday, May 16, 2014 at 7:30 am

Post a Comment

Comments are automatically formatted. Markup are either stripped or will cause large blocks of text to be eaten, depending on the phase of the moon. Code snippets should be wrapped in <pre>...</pre> tags. Indentation within pre tags will be preserved, and most instances of "<" and ">" will work without a problem.

Please note that first-time commenters are moderated, so don't panic if your comment doesn't appear immediately.

*
*