Skip to content
Print

The Magic Behind Parser Combinators

24
Mar
2009

If you’re like me, one of the first things that attracted you to Scala was its parser combinators.  Well, maybe that wasn’t the first thing for me, but it was pretty far up there.  Parser combinators make it almost too easy to create a parser for a complex language without ever leaving the comfortable play-pen afforded by Scala.  Incidentally, if you aren’t familiar with the fundamentals of text parsing, context-free grammars and/or parser generators, then you might want to do some reading before you continue with this article.

Intro to Parser Combinators

In most languages, the process of creating a text parser is usually an arduous and clumsy affair involving a parser generator (such as ANTLR, JavaCC, Beaver or <shamelessPlug>ScalaBison</shamelessPlug>) and (usually) a lexer generator such as JFlex.  These tools do a very good job of generating sources for efficient and powerful parsers, but they aren’t exactly the easiest tools to use.  They generally have a very steep learning curve, and due to their unique status as compiler-compilers, an unintuitive architecture.  Additionally, these tools can be somewhat rigid, making it very difficult to implement unique or experimental features.  For this reason alone, many real world compilers and interpreters (such as javac, ruby, jruby and scalac) actually use hand-written parsers.  These are usually easier to tweak, but they can be very difficult to develop and test.  Additionally, hand-written parsers have a tendency toward poor performance (think: the Scala compiler).

When creating a compiler in Scala, it is perfectly acceptable to make use of these conventional Java-generating tools like ANTLR or Beaver, but we do have other options.  Parser combinators are a domain-specific language baked into the standard library.  Using this internal DSL, we can create an instance of a parser for a given grammar using Scala methods and fields.  What’s more, we can do this in a very declarative fashion.  Thanks to the magic of DSLs, our sources will actually look like a plain-Jane context-free grammar for our language.  This means that we get most of the benefits of a hand-written parser without losing the maintainability afforded by parser generators like bison.  For example, here is a very simple grammar for a simplified Scala-like language, expressed in terms of parser combinators:

object SimpleScala extends RegexpParsers {
 
  val ID = """[a-zA-Z]([a-zA-Z0-9]|_[a-zA-Z0-9])*"""r
 
  val NUM = """[1-9][0-9]*"""r
 
  def program = clazz*
 
  def classPrefix = "class" ~ ID ~ "(" ~ formals ~ ")"
 
  def classExt = "extends" ~ ID ~ "(" ~ actuals ~ ")"
 
  def clazz = classPrefix ~ opt(classExt) ~ "{" ~ (member*) ~ "}"
 
  def formals = repsep(ID ~ ":" ~ ID, ",")
 
  def actuals = expr*
 
  def member = (
      "val" ~ ID ~ ":" ~ ID ~ "=" ~ expr
    | "var" ~ ID ~ ":" ~ ID ~ "=" ~ expr
    | "def" ~ ID ~ "(" ~ formals ~ ")" ~ ":" ~ ID ~ "=" ~ expr
    | "def" ~ ID ~ ":" ~ ID ~ "=" ~ expr
    | "type" ~ ID ~ "=" ~ ID
  )
 
  def expr: Parser[Expr] = factor ~ (
      "+" ~ factor
    | "-" ~ factor
  )*
 
  def factor = term ~ ("." ~ ID ~ "(" ~ actuals ~ ")")*
 
  def term = (
      "(" ~ expr ~ ")"
    | ID
    | NUM
  )
}

This is all valid and correct Scala.  The program method returns an instance of Parser[List[Class_]], assuming that Class_ is the AST class representing a syntactic class in the language (and assuming that we had added all of the boiler-plate necessary AST generation).  This Parser instance can then be used to process a java.io.Reader, producing some result if the input is valid, otherwise producing an error.

How the Magic Works

The really significant thing to notice here is that program is nothing special; just another method which returns an instance of class Parser.  In fact, all of these methods return instances of Parser.  Once you realize this, the magic behind all of this becomes quite a bit more obvious.  However, to really figure it all out, we’re going to need to take a few steps back.

Conceptually, a Parser represents a very simple idea:

Parsers are invoked upon an input stream.  They will consume a certain number of tokens and then return a result along with the truncated stream.  Alternatively, they will fail, producing an error message.

Every Parser instance complies with this description.  To be more concrete, consider the keyword parser (what I like to call the “literal” parser) which consumes a single well-defined token.  For example (note that the keyword method is implicit in Scala’s implementation of parser combinators, which is why it doesn’t appear in the long example above):

def boring = keyword("bore")

The boring method returns a value of type Parser[String].  That is, a parser which consumes input and somehow produces a String as a result (along with the truncated stream).  This parser will either parse the characters b-o-r-e in that order, or it will fail.  If it succeeds, it will return the string "bore" as a result along with a stream which is shortened by four characters.  If it fails, it will return an error message along the lines of “Expected 'bore', got 'Boer'“, or something to that effect.

By itself, such a parser is really not very useful.  After all, it’s easy enough to perform a little bit of String equality testing when looking for a well-defined token.  The real power of parser combinators is in what happens when you start combining them together (hence the name).  A few literal parsers combined in sequence can give us a phrase in our grammar, and a few of these sequences combined in a disjunction can give us the full power of a non-terminal with multiple productions.  As it turns out, all we need is the literal parser (keyword) combined with these two types of combinators to express any LL(*) grammar.

Before we get into the combinators themselves, let’s build a framework.  We will define Parser[A] as a function from Stream[Character] to Result[A], where Result[A] has two implementations: Success[A] and Failure.  The framework looks like the following:

trait Parser[+A] extends (Stream[Character]=>Result[A])
 
sealed trait Result[+A]
 
case class Success[+A](value: A, rem: Stream[Character]) extends Result[A]
 
case class Failure(msg: String) extends Result[Nothing]

Additionally, we must add a concrete parser, keyword, to handle our literals.  For the sake of syntactic compatibility with Scala’s parser combinators, this parser will be defined within the RegexpParsers singleton object (despite the fact that we don’t really support regular expressions):

object RegexpParsers {
  implicit def keyword(str: String) = new Parser[String] {
    def apply(s: Stream[Character]) = {
      val trunc = s take str.length
      lazy val errorMessage = "Expected '%s' got '%s'".format(str, trunc.mkString)
 
      if (trunc lengthCompare str.length != 0) 
        Failure(errorMessage)
      else {
        val succ = trunc.zipWithIndex forall {
          case (c, i) => c == str(i)
        }
 
        if (succ) Success(str, s drop str.length)
        else Failure(errorMessage)
      }
    }
  }
}

For those of you who are still a little uncomfortable with the more obscure higher-order utility methods in the Scala collections framework: don’t worry about it.  While the above may be a bit obfuscated, there isn’t really a need to understand what’s going on at any sort of precise level.  The important point is that this Parser defines an apply method which compares str to an equally-lengthed prefix from s, the input character Stream.  At the end of the day, it returns either Success or Failure.

The Sequential Combinator

The first of the two combinators we need to look at is the sequence combinator.  Conceptually, this combinator takes two parsers and produces a new parser which matches the first and the second in order.  If either one of the parsers produces a Failure, then the entire sequence will fail.  In terms of classical logic, this parser corresponds to the AND operation.  The code for this parser is almost ridiculously simple:

class SequenceParser[+A, +B](l: =>Parser[A], 
                             r: =>Parser[B]) extends Parser[(A, B)] {
  lazy val left = l
  lazy val right = r
 
  def apply(s: Stream[Character]) = left(s) match {
    case Success(a, rem) => right(rem) match {
      case Success(b, rem) => Success((a, b), rem)
      case f: Failure => f
    }
 
    case f: Failure => f
  }
}

This is literally just a parser which applies its left operand and then applies its right to whatever is left.  As long as both parsers succeed, a composite Success will be produced containing a tuple of the left and right parser’s results.  Note that Scala’s parser combinator framework actually yields an instance of the ~ case class from its sequence combinator.  This is particularly convenient as it allows for a very nice syntax in pattern matching for semantic actions (extracting parse results).  However, since we will not be dealing with the action combinators in this article, it seemed simpler to just use a tuple.

One item worthy of note is the fact that both left and right are evaluated lazily.  This means that we don’t actually evaluate our constructor parameters until the parser is applied to a specific stream.  This is very important as it allows us to define parsers with recursive rules.  Recursion is really what separates context-free grammars from regular expressions.  Without this laziness, any recursive rules would lead to an infinite loop in the parser construction.

Once we have the sequence combinator in hand, we can add a bit of syntax sugar to enable its use.  All instances of Parser will define a ~ operator which takes a single operand and produces a SequenceParser which handles the receiver and the parameter in order:

trait Parser[+A] extends (Stream[Character]=>Result[A]) {
  def ~[B](that: =>Parser[B]) = new SequenceParser(this, that)
}

With this modification to Parser, we can now create parsers which match arbitrary sequences of tokens.  For example, our framework so far is more than sufficient to define the classPrefix parser from our earlier snippet (with the exception of the regular expression defined in ID, which we currently have no way of handling):

def classPrefix = "class" ~ ID ~ "(" ~ formals ~ ")"

The Disjunctive Combinator

This is a very academic name for a very simple concept.  Let’s think about the framework so far.  We have both literal parsers and sequential combinations thereof.  Using this framework, we are capable of defining parsers which match arbitrary token strings.  We can even define parsers which match infinite token sequences, simply by involving recursion:

def fearTheOnes: Parser[Any] = "1" ~ fearTheOnes

Of course, this parser is absurd, since it only matches an infinite input consisting of the '1' character, but it does serve to illustrate that we have a reasonably powerful framework even in its current form.  This also provides a nice example of how the lazy evaluation of sequence parsers is an absolutely essential feature.  Without it, the fearTheOnes method would enter an infinite loop and would never return an instance of Parser.

However, for all its glitz, our framework is still somewhat impotent compared to “real” parser generators.  It is almost trivial to derive a grammar which cannot be handled by our parser combinators.  For example:

e ::= '1' | '2'

This grammar simply says “match either the '1' character, or the '2' character”.  Unfortunately, our framework is incapable of defining a parser according to this rule.  We have no facility for saying “either this or that”.  This is where the disjunctive combinator comes into play.

In boolean logic, a disjunction is defined according to the following truth table:

P Q P V Q
T T T
T F T
F T T
F F F

In other words, the disjunction is true if one or both of its component predicates are true.  This is exactly the sort of combinator we need to bring our framework to full LL(*) potential.  We need to define a parser combinator which takes two parsers as parameters, trying each of them in order.  If the first parser succeeds, we yield its value; otherwise, we try the second parser and return its Result (whether Success or Failure).  Thus, our disjunctive combinator should yield a parser which succeeds if and only if one of its component parsers succeeds.  This is very easily accomplished:

class DisParser[+A](left: Parser[A], right: Parser[A]) extends Parser[A] {
  def apply(s: Stream[Character]) = left(s) match {
    case res: Success => res
    case _: Failure => right(s)
  }
}

Once again, we can beautify the syntax a little bit by adding an operator to the Parser super-trait:

trait Parser[+A] extends (Stream[Character]=>Result[A]) {
  def ~[B](that: =>Parser[B]) = new SequenceParser(this, that)
 
  def |(that: Parser[A]) = new DisParser(this, that)
}

…is that all?

It’s almost as if by magic, but the addition of the disjunctive combinator to the sequential actually turns our framework into something really special, capable of chewing through any LL(*) grammar.  Just in case you don’t believe me, consider the grammar for the pure-untyped lambda calculus, expressed using our framework (alph definition partially elided for brevity):

object LambdaCalc extends RegexpParsers {
 
  def expr: Parser[Any] = term ~ (expr | "")
 
  def term = (
      "fn" ~ id ~ "=>" ~ expr
    | id
  )
 
  val alph = "a"|"b"|"c"|...|"X"|"Y"|"Z"
  val num = "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"
 
  def id = alph ~ idSuffix
 
  def idSuffix = (
      (alph | num) ~ idSuffix
    | ""
  )
}

While this grammar may seem a bit obfuscated, it is only because I had to avoid the use of regular expressions to define the ID rule.  Instead, I used a combination of sequential and disjunctive combinators to produce a Parser which matches the desired pattern.  Note that the “...” is not some special syntax, but rather my laziness and wish to avoid a code snippet 310 characters wide.

We can also use our framework to define some other, useful combinators such as opt and * (used in the initial example). Specifically:

trait Parser[+A] extends (Stream[Character]=>Result[A]) {
  ...
 
  def *: Parser[List[A]] = (
      this ~ * ^^ { case (a, b) => a :: b }
    | "" ^^^ Nil
  )
}
 
object RegexpParsers {
  ...
 
  def opt[A](p: Parser[A]) = (
      p ^^ { Some(_) }
    | "" ^^^ None
  )
}

Readers who have managed to stay awake to this point may notice that I’m actually cheating a bit in these definitions.  Specifically, I’m using the ^^ and ^^^ combinators.  These are the semantic action combinators which I promised to avoid discussing.  However, for the sake of completeness, I’ll include the sources and leave you to figure out the rest:

trait Parser[+A] extends (Stream[Character]=>Result[A]) { outer =>
  ...
 
  def ^^[B](f: A=>B) = new Parser[B] {
    def apply(s: Stream[Character]) = outer(s) match {
      case Success(v, rem) => Success(f(v), rem)
      case f: Failure => f
    }
  }
 
  def ^^^[B](v: =>B) = new Parser[B] {
    def apply(s: Stream[Character]) = outer(s) match {
      case Success(_, rem) => Success(v, rem)
      case f: Failure => f
    }
  }
}

In short, these combinators are only interesting to people who want their parsers to give them a value upon completion (usually an AST).  In short, just about any useful application of parser combinators will require these combinators, but since we’re not planning to use our framework for anything useful, there is really no need.

One really interesting parser from our first example which is worthy of attention is the member rule.  If you recall, this was defined as follows:

def member = (
    "val" ~ ID ~ ":" ~ ID ~ "=" ~ expr
  | "var" ~ ID ~ ":" ~ ID ~ "=" ~ expr
  | "def" ~ ID ~ "(" ~ formals ~ ")" ~ ":" ~ ID ~ "=" ~ expr
  | "def" ~ ID ~ ":" ~ ID ~ "=" ~ expr
  | "type" ~ ID ~ "=" ~ ID
)

This is interesting for two reasons.  First: we have multiple disjunctions handled in the same rule, showing that disjunctive parsers chain just as nicely as do sequential.  But more importantly, our chain of disjunctions includes two parsers which have the same prefix ("def" ~ ID).  In other words, if we attempt to parse an input of “def a: B = 42“, one of these deeply nested parsers will erroneously match the input for the first two tokens.

This grammatical feature forces us to implement some sort of backtracking within our parser combinators.  Intuitively, the "def" ~ ID parser is going to successfully match “def a“, but the enclosing sequence ("def" ~ ID ~ "(") will fail as soon as the “:” token is reached.  At this point, the parser has to take two steps back in the token stream and try again with another parser, in this case, "def" ~ ID ~ ":" ~ ID ~ "=" ~ expr.  It is this feature which separates LL(*) parsing from LL(1) and LL(0).

The good news is that we already have this feature almost by accident.  Well, obviously not by accident since I put some careful planning into this article, but at no point so far did we actually set out to implement backtracking, and yet it has somehow dropped into our collective lap.  Consider once more the implementation of the disjunctive parser:

class DisParser[+A](left: Parser[A], right: Parser[A]) extends Parser[A] {
  def apply(s: Stream[Character]) = left(s) match {
    case res: Success => res
    case _: Failure => right(s)
  }
}

Notice what happens if left fails: it invokes the right parser passing the same Stream instance (s).  Recall that Stream is immutable, meaning that there is nothing left can do which could possibly change the value of s.  Each parser is merely grabbing characters from the head of the stream and then producing a new Stream which represents the remainder.  Parsers farther up the line (like our disjunctive parser) are still holding a reference to the stream prior to these “removals”.  This means that we don’t need to make any special effort to implement backtracking, it just falls out as a natural consequence of our use of the Stream data structure.  Isn’t that nifty?

Conclusion

Parser combinators are an incredibly clever bit of functional programming.  Every time I think about them, I am once again impressed by the ingenuity of their design and the simple elegance of their operation.  The fact that two combinators and a single parser can encode the vast diversity of LL(*) grammars is simply mind-boggling.  Despite their simplicity, parser combinators are capable of some very powerful parsing in a very clean and intuitive fashion.  That to me is magical.

Comments

  1. Thanks for a great post. This brought memories of my first commercial experience of DSL development: http://en.wikibooks.org/wiki/REBOL_Programming/Language_Features/Parse

    David Vydra Tuesday, March 24, 2009 at 9:10 am
  2. Hey daniel – by coincidence I am just finishing rewriting the scalac parser with combinators. I had to add memoization to the combinator lib but even with no other attempts to improve performance it looks like they might end up fast enough. But, last mile could end up end being a lot longer than it appears.

    Paul Phillips Tuesday, March 24, 2009 at 1:11 pm
  3. @Paul

    Actually, I’ve been doing some research lately involving extensive performance testing of Scala’s parser combinators (2.7.3). The results show that parser combinators are about 5x-10x slower than parsers generated by ScalaBison, a directly-encoded recursive ascent/descent parser generator (generates LALR(1) parsers in Scala). Parser combinators are a whopping >20x slower than Beaver, which is a fast table-driven LALR(1) parser generator for Java. Parser combinators also use significantly more memory.

    The best optimization techniques I have found (and read about) involve eliminating backtracking. Merge common prefixes in productions. This reduces the formal strength of your grammar, but also tends to factor shared code into a single location. I’m sure you know this, but you can also use the ~! combinator to halt backtracking for a specific location in a production. I’ve found this technique is most useful to generate better errors, but it’s supposed to help performance too.

    Memoization is an interesting optimization and I would imagine a very effective one. A *really* nice optimization (that I have yet to implement) would be to make disjunctive parsers predictive in the first token (so long as the parsers in question have definite parse values). The advantage here is it would make most disjunctions constant time as opposed to linear in the number of alternatives.

    Daniel Spiewak Tuesday, March 24, 2009 at 6:07 pm
  4. Yes, !~ is necessary (I also added !~> and ~

    Paul Phillips Wednesday, March 25, 2009 at 6:08 am
  5. Oh good, I didn’t read the markup instructions and my whole comment got eaten. Now I remember why I never comment in blogs.

    Paul Phillips Wednesday, March 25, 2009 at 6:17 am
  6. Yeah, it’s on my TODO list to come up with a better comment system for my blog (preferably one with syntax highlighting), but WP is just so annoying and my time has been so sparing… Could you try reposting? It sounded interesting.

    I’m glad that you added !~>, that is one combinator I’ve wished for in the past, but never wished for hard enough to take the time to write myself. :-)

    Daniel Spiewak Wednesday, March 25, 2009 at 9:11 am
  7. I’m not deeply into Scala, so forgive the ignorance that leads to this question: one of the issues I’ve found with trying to create parser combinators in less expressive languages is the circularity problem in the grammar viewed as a graph.

    ‘expr’ uses ‘term’, ‘term’ uses ‘factor’, but ‘factor’ uses ‘expr’.

    Most imperative languages cannot initialize such expressions correctly: one element in the cycle will be null after class initialization, and the graph won’t have a cycle. The only ways out seem to be some kind of indirection, either encoding one (or more) of the productions as a closure (possibly returning a private field), or, if the language has them, location references.

    How does Scala deal with this circularity issue?

    Barry Kelly Wednesday, March 25, 2009 at 11:46 pm
  8. Whups – I used ‘term’, ‘factor’ and ‘expr’ as they are classically used (mul-ops, atoms / parens, add-ops), so I missed that your cycle is slightly different, expr->factor->term->expr, but the question remains.

    Barry Kelly Wednesday, March 25, 2009 at 11:49 pm
  9. Re the performance issue, I would consider compiling the grammar graph into e.g. .class files that implement another strategy, such as LALR or another. That will limit expressiveness over grammars that rely on combinatorial explosion of matching options, etc., but would help with speed. Ambiguity checking wouldn’t happen until runtime though.

    Barry Kelly Wednesday, March 25, 2009 at 11:51 pm
  10. In the framework implemented above (as in Scala’s standard framework), circularity problems are avoided by the following declaration:

    class SequenceParser[+A, +B](l: =>Parser[A],
                                 r: =>Parser[B]) extends Parser[(A, B)] {
      lazy val left = l
      lazy val right = r
      ...
    }

    Both the left and the right parameters of the sequence parser (and its corresponding combinator) are call-by-name, which means that they are not evaluated until they are absolutely needed. The assignment to lazy val(s) just memoizes that evaluation, turning the parameters into effectively call-by-need (think: Haskell).

    Personally, I prefer a framework which has lazy disjunctions rather than sequences. We can turn this same trick with the disjunctive combinator, gaining the following advantages:

    * Left-recursion is now at least possible to encode without infinite recursion (still doesn’t parse).
    * Some degenerate grammars will immediately lead to infinite recursion in the parser construction, rather than “waiting” until the parser gets stuck.

    The only disadvantage to this approach is it forces the “|” combinator to be implemented using implicit conversions, otherwise the dispatch receiver would be eagerly-evaluated, circumventing the laziness.

    Daniel Spiewak Thursday, March 26, 2009 at 3:20 am
  11. Hi Daniel, thanks for that great post. Parser combinators are indeed an impressive example of functional programming techniques.

    A while ago, I had the opportunity to use in a few professional projects the Clean functional programming language (http://clean.cs.ru.nl/), which is a close cousin to Haskell. I used monadic parser combinators a lot, and found the following paper to be a nice introductory text, as well as an illustration of the use of more advanced techniques focused on efficency. The paper demonstrates the use of continuations to improve the execution efficiency of parser combinators.

    P. Koopman and M.J. Plasmeijer. Efficient Combinator Parsers, In Proc. of Implementation of Functional Languages (IFL ‘98), London, UK, K. Hammond, A.J.T. Davie and C. Clack (Eds.), Springer Verlag, LNCS 1595, pp. 120-136.

    http://www.st.cs.ru.nl/papers/1999/koop98-EffCombParsIFL98.ps.gz

    As Clean isa lazy functional programming language, which supports tail recursion, I am not sure these techniques do apply to Scala, but you may find the paper interesting nevertheless.

    Fabien Todescato Wednesday, April 1, 2009 at 2:43 am
  12. thanks for the great article! just curious, how are whitespaces handled in your parsers? what would be best strategy to handle whitespaces and comments?

    Walter Chang Monday, April 13, 2009 at 2:34 am
  13. @Walter

    Whitespace and comments are actually a trickier issue. My personal preference on handling them is to first feed the input source through a proper scanner, which then produces tokens and removes all unnecessary whitespace (and comments). Parser combinators are certainly compatible with this technique, but for some reason it is rarely employed.

    In order to implement whitespace-stripping in my parsers, I would have to do something similar to what RegexParsers does in the Scala combinator framework, which is basically pre-process the stream before every parser, stripping any leading whitespace. Comments can be handled in the same pass.

    Daniel Spiewak Monday, April 13, 2009 at 7:49 am
  14. Hi Daniel,

    I’ve stumbled across your blog. I just want to say thank you, you have a great writing style and the content is hugely informative. I’m trying to learn Scala better and because the language is so grand in scope and capability it can be hard getting started. I know how much time it takes to write quality material. Anyways, thanks again.

    Nathan Matthews Wednesday, July 29, 2009 at 2:13 pm
  15. unfortunately, most of the regex libraries used are the unintuitive, potentially exponential time (or worse) backtracking Perl libraries. It’s a nice idea in theory. I think the research in Elkhound is the nicest technology — a GLR parser that runs as fast as LALR on most inputs. Unfortunately the implementation is c++ and rather geared towards parsing c++ (which demonstrates that it can tackle challenging grammars, but doesn’t so much make it accessible for small languages).

    gatoatigrado Friday, July 31, 2009 at 1:00 am
  16. @gatoatigrado

    GLR is interesting, but it has some fairly serious flaws. For example, despite Tomita’s claims, the algorithm is not O(n^3) in the worst case. This grammar in particular is O(n^4) for all valid input:

    S ::= S S
        | S S S
        | b
    

    Also, GLR is not fully generalized. There are grammars for which GLR (as with any shift-reduce parsing technique) is undefined:

    A ::= B A a
        | a
    
    B ::= \epsilon
    

    Granted, most GLR parser generators will transform the above grammar into something which is actually usable within the constraints of the algorithm, but it doesn’t change the fact that GLR is not entirely what it claims to be.

    Fortunately, there are algorithms which are fully-general and which attain cubic asymptotic bounds without absurd memory usage. For example, GLL (Generalized LL) is one such technique presented at LDTA 2009. It’s basically just recursive-descent, except that it can handle multiple simultaneous parse trails. And, as with recursive-descent, it is extremely amenable to representation within a functional, composable framework (like parser combinators). In other words, you get all of the benefits of a true, generalized parsing algorithm, cubic performance in the worst case with linear performance on most grammars, all within the easy-to-use framework of parser combinators. Share and enjoy: http://github.com/djspiewak/gll-combinators

    Daniel Spiewak Friday, July 31, 2009 at 10:24 pm
  17. I just stumbled from a google search across your blog. Thanks for the great insight. This helped me a lot to understand the behavior of my experimental parsers written in scala. I also noted the lack of activity in your blog ;-) Any chance, you find the time for more elaborations in the future?

    Axel Sammet Friday, March 26, 2010 at 5:02 am
  18. Wonderful post. Exactly what I was looking for (although, admittedly, it seems like I’m one year late ;-)

    Alexander Lehmann Saturday, March 27, 2010 at 5:11 am
  19. You write: “hand-written parsers have a tendency toward poor performance (think: the Scala compiler)”. Yes, the Scala compiler is slow, but the parsing phase isn’t the bottleneck. It’s later phases such as the type checker.

    Seth Tisue Monday, March 29, 2010 at 12:33 pm
  20. Any chance for a follow up expanding on scala 2.8 angle on parser combinators? (packrat?)

    Søren B. Vrist Monday, July 26, 2010 at 2:28 am
  21. One major problem with recursive descent algorithms, such as parser combinators, is they do not do a LL(k) global search for First_k and Follow_k sets ambiguities at parser generation time. You won’t actually know you have an ambiguity if and until you encounter it in the input at runtime. This is quite critical for developing a language:

    http://members.cox.net/slkpg/documentation.html#SLK_FAQ

    In the development of Copute’s grammar, I found critical ambiguities that would have not been evident if I had gone with parser combinators (aka recursive descent algorithms). The tsuris I encountered in resolving ambiguities was due to incorrect grammar.

    Also they will never be as faster, because for the k lookahead conflicts, they follow unnecessary paths, because the global optimization (look ahead tables) was not done.

    I don’t see what is the benefit? Perhaps it is just that the LL(k) parser generation tools are not written in good functional programming style in modern languages, thus making them difficult to adapt to and bootstrap in your new language.

    Shelby Moore III Saturday, January 15, 2011 at 2:20 am
  22. Also just because the time spent of compilation semantic analysis of the AST is often much greater than the time spent in the parser stage, the speed of the parser stage is very important for JIT compilation/interpreter.

    Also I had looked at the recursive descent algorithms option and rejected it objectively.

    The speedup in development time from not finding ambiguities will not make ambiguities disappear, because the grammar would still be ambiguous otherwise (backtracing makes a grammar ambiguous except in rare cases), and ambiguity results in semantic inconsistencies in the language, which get borne out as needless programming bugs that waste the hours of every programmer using the language. I have numerous examples of resolved issues at my Google code tracker for Copute.

    So that speed up in development effort incurs a cost that is going to paid (probably more excruciatingly) down the line.

    Shelby Moore III Saturday, January 15, 2011 at 2:37 am
  23. http://en.wikipedia.org/w/index.php?title=Recursive_descent_parser&oldid=407998337#Shortcomings

    In the general case, recursive descent parsers are not limited to the Context-free grammar and thus do no global search for ambiguities in the LL(k) parsing First_k and Follow_k sets. Thus ambiguities are not known until run-time if and until the input triggers them. Such ambiguities where the recursive descent parser defaults (perhaps unknown to the grammar designer) to one of the possible ambiguous paths, thus results in semantic confusion (aliasing) in the use of the language, and leads to bugs by users of ambiguous programming languages which are not reported at compile-time, and which are introduced not by human error, but by ambiguous grammar. The only solution which eliminates these bugs, is to remove the ambiguities and use a Context-free grammar.

    http://en.wikipedia.org/w/index.php?title=Parser_combinator&oldid=407998210#Shortcomings_and_solutions

    Parser combinators, like all recursive descent parsers, are not limited to the Context-free grammar and thus do no global search for ambiguities in the LL(k) parsing First_k and Follow_k sets. Thus ambiguities are not known until run-time if and until the input triggers them. Such ambiguities where the recursive descent parser defaults (perhaps unknown to the grammar designer) to one of the possible ambiguous paths, thus results in semantic confusion (aliasing) in the use of the language, and leads to bugs by users of ambiguous programming languages which are not reported at compile-time, and which are introduced not by human error, but by ambiguous grammar. The only solution which eliminates these bugs, is to remove the ambiguities and use a Context-free grammar.

    Shelby Moore III Saturday, January 15, 2011 at 3:23 am

Post a Comment

Comments are automatically formatted. Markup are either stripped or will cause large blocks of text to be eaten, depending on the phase of the moon. Code snippets should be wrapped in <pre>...</pre> tags. Indentation within pre tags will be preserved, and most instances of "<" and ">" will work without a problem.

Please note that first-time commenters are moderated, so don't panic if your comment doesn't appear immediately.

*
*