Skip to content

The Magic Behind Parser Combinators

24
Mar
2009

If you’re like me, one of the first things that attracted you to Scala was its parser combinators.  Well, maybe that wasn’t the first thing for me, but it was pretty far up there.  Parser combinators make it almost too easy to create a parser for a complex language without ever leaving the comfortable play-pen afforded by Scala.  Incidentally, if you aren’t familiar with the fundamentals of text parsing, context-free grammars and/or parser generators, then you might want to do some reading before you continue with this article.

Intro to Parser Combinators

In most languages, the process of creating a text parser is usually an arduous and clumsy affair involving a parser generator (such as ANTLR, JavaCC, Beaver or <shamelessPlug>ScalaBison</shamelessPlug>) and (usually) a lexer generator such as JFlex.  These tools do a very good job of generating sources for efficient and powerful parsers, but they aren’t exactly the easiest tools to use.  They generally have a very steep learning curve, and due to their unique status as compiler-compilers, an unintuitive architecture.  Additionally, these tools can be somewhat rigid, making it very difficult to implement unique or experimental features.  For this reason alone, many real world compilers and interpreters (such as javac, ruby, jruby and scalac) actually use hand-written parsers.  These are usually easier to tweak, but they can be very difficult to develop and test.  Additionally, hand-written parsers have a tendency toward poor performance (think: the Scala compiler).

When creating a compiler in Scala, it is perfectly acceptable to make use of these conventional Java-generating tools like ANTLR or Beaver, but we do have other options.  Parser combinators are a domain-specific language baked into the standard library.  Using this internal DSL, we can create an instance of a parser for a given grammar using Scala methods and fields.  What’s more, we can do this in a very declarative fashion.  Thanks to the magic of DSLs, our sources will actually look like a plain-Jane context-free grammar for our language.  This means that we get most of the benefits of a hand-written parser without losing the maintainability afforded by parser generators like bison.  For example, here is a very simple grammar for a simplified Scala-like language, expressed in terms of parser combinators:

object SimpleScala extends RegexpParsers {
 
  val ID = """[a-zA-Z]([a-zA-Z0-9]|_[a-zA-Z0-9])*"""r
 
  val NUM = """[1-9][0-9]*"""r
 
  def program = clazz*
 
  def classPrefix = "class" ~ ID ~ "(" ~ formals ~ ")"
 
  def classExt = "extends" ~ ID ~ "(" ~ actuals ~ ")"
 
  def clazz = classPrefix ~ opt(classExt) ~ "{" ~ (member*) ~ "}"
 
  def formals = repsep(ID ~ ":" ~ ID, ",")
 
  def actuals = expr*
 
  def member = (
      "val" ~ ID ~ ":" ~ ID ~ "=" ~ expr
    | "var" ~ ID ~ ":" ~ ID ~ "=" ~ expr
    | "def" ~ ID ~ "(" ~ formals ~ ")" ~ ":" ~ ID ~ "=" ~ expr
    | "def" ~ ID ~ ":" ~ ID ~ "=" ~ expr
    | "type" ~ ID ~ "=" ~ ID
  )
 
  def expr: Parser[Expr] = factor ~ (
      "+" ~ factor
    | "-" ~ factor
  )*
 
  def factor = term ~ ("." ~ ID ~ "(" ~ actuals ~ ")")*
 
  def term = (
      "(" ~ expr ~ ")"
    | ID
    | NUM
  )
}

This is all valid and correct Scala.  The program method returns an instance of Parser[List[Class_]], assuming that Class_ is the AST class representing a syntactic class in the language (and assuming that we had added all of the boiler-plate necessary AST generation).  This Parser instance can then be used to process a java.io.Reader, producing some result if the input is valid, otherwise producing an error.

How the Magic Works

The really significant thing to notice here is that program is nothing special; just another method which returns an instance of class Parser.  In fact, all of these methods return instances of Parser.  Once you realize this, the magic behind all of this becomes quite a bit more obvious.  However, to really figure it all out, we’re going to need to take a few steps back.

Conceptually, a Parser represents a very simple idea:

Parsers are invoked upon an input stream.  They will consume a certain number of tokens and then return a result along with the truncated stream.  Alternatively, they will fail, producing an error message.

Every Parser instance complies with this description.  To be more concrete, consider the keyword parser (what I like to call the “literal” parser) which consumes a single well-defined token.  For example (note that the keyword method is implicit in Scala’s implementation of parser combinators, which is why it doesn’t appear in the long example above):

def boring = keyword("bore")

The boring method returns a value of type Parser[String].  That is, a parser which consumes input and somehow produces a String as a result (along with the truncated stream).  This parser will either parse the characters b-o-r-e in that order, or it will fail.  If it succeeds, it will return the string "bore" as a result along with a stream which is shortened by four characters.  If it fails, it will return an error message along the lines of “Expected 'bore', got 'Boer'“, or something to that effect.

By itself, such a parser is really not very useful.  After all, it’s easy enough to perform a little bit of String equality testing when looking for a well-defined token.  The real power of parser combinators is in what happens when you start combining them together (hence the name).  A few literal parsers combined in sequence can give us a phrase in our grammar, and a few of these sequences combined in a disjunction can give us the full power of a non-terminal with multiple productions.  As it turns out, all we need is the literal parser (keyword) combined with these two types of combinators to express any LL(*) grammar.

Before we get into the combinators themselves, let’s build a framework.  We will define Parser[A] as a function from Stream[Character] to Result[A], where Result[A] has two implementations: Success[A] and Failure.  The framework looks like the following:

trait Parser[+A] extends (Stream[Character]=>Result[A])
 
sealed trait Result[+A]
 
case class Success[+A](value: A, rem: Stream[Character]) extends Result[A]
 
case class Failure(msg: String) extends Result[Nothing]

Additionally, we must add a concrete parser, keyword, to handle our literals.  For the sake of syntactic compatibility with Scala’s parser combinators, this parser will be defined within the RegexpParsers singleton object (despite the fact that we don’t really support regular expressions):

object RegexpParsers {
  implicit def keyword(str: String) = new Parser[String] {
    def apply(s: Stream[Character]) = {
      val trunc = s take str.length
      lazy val errorMessage = "Expected '%s' got '%s'".format(str, trunc.mkString)
 
      if (trunc lengthCompare str.length != 0) 
        Failure(errorMessage)
      else {
        val succ = trunc.zipWithIndex forall {
          case (c, i) => c == str(i)
        }
 
        if (succ) Success(str, s drop str.length)
        else Failure(errorMessage)
      }
    }
  }
}

For those of you who are still a little uncomfortable with the more obscure higher-order utility methods in the Scala collections framework: don’t worry about it.  While the above may be a bit obfuscated, there isn’t really a need to understand what’s going on at any sort of precise level.  The important point is that this Parser defines an apply method which compares str to an equally-lengthed prefix from s, the input character Stream.  At the end of the day, it returns either Success or Failure.

The Sequential Combinator

The first of the two combinators we need to look at is the sequence combinator.  Conceptually, this combinator takes two parsers and produces a new parser which matches the first and the second in order.  If either one of the parsers produces a Failure, then the entire sequence will fail.  In terms of classical logic, this parser corresponds to the AND operation.  The code for this parser is almost ridiculously simple:

class SequenceParser[+A, +B](l: =>Parser[A], 
                             r: =>Parser[B]) extends Parser[(A, B)] {
  lazy val left = l
  lazy val right = r
 
  def apply(s: Stream[Character]) = left(s) match {
    case Success(a, rem) => right(rem) match {
      case Success(b, rem) => Success((a, b), rem)
      case f: Failure => f
    }
 
    case f: Failure => f
  }
}

This is literally just a parser which applies its left operand and then applies its right to whatever is left.  As long as both parsers succeed, a composite Success will be produced containing a tuple of the left and right parser’s results.  Note that Scala’s parser combinator framework actually yields an instance of the ~ case class from its sequence combinator.  This is particularly convenient as it allows for a very nice syntax in pattern matching for semantic actions (extracting parse results).  However, since we will not be dealing with the action combinators in this article, it seemed simpler to just use a tuple.

One item worthy of note is the fact that both left and right are evaluated lazily.  This means that we don’t actually evaluate our constructor parameters until the parser is applied to a specific stream.  This is very important as it allows us to define parsers with recursive rules.  Recursion is really what separates context-free grammars from regular expressions.  Without this laziness, any recursive rules would lead to an infinite loop in the parser construction.

Once we have the sequence combinator in hand, we can add a bit of syntax sugar to enable its use.  All instances of Parser will define a ~ operator which takes a single operand and produces a SequenceParser which handles the receiver and the parameter in order:

trait Parser[+A] extends (Stream[Character]=>Result[A]) {
  def ~[B](that: =>Parser[B]) = new SequenceParser(this, that)
}

With this modification to Parser, we can now create parsers which match arbitrary sequences of tokens.  For example, our framework so far is more than sufficient to define the classPrefix parser from our earlier snippet (with the exception of the regular expression defined in ID, which we currently have no way of handling):

def classPrefix = "class" ~ ID ~ "(" ~ formals ~ ")"

The Disjunctive Combinator

This is a very academic name for a very simple concept.  Let’s think about the framework so far.  We have both literal parsers and sequential combinations thereof.  Using this framework, we are capable of defining parsers which match arbitrary token strings.  We can even define parsers which match infinite token sequences, simply by involving recursion:

def fearTheOnes: Parser[Any] = "1" ~ fearTheOnes

Of course, this parser is absurd, since it only matches an infinite input consisting of the '1' character, but it does serve to illustrate that we have a reasonably powerful framework even in its current form.  This also provides a nice example of how the lazy evaluation of sequence parsers is an absolutely essential feature.  Without it, the fearTheOnes method would enter an infinite loop and would never return an instance of Parser.

However, for all its glitz, our framework is still somewhat impotent compared to “real” parser generators.  It is almost trivial to derive a grammar which cannot be handled by our parser combinators.  For example:

e ::= '1' | '2'

This grammar simply says “match either the '1' character, or the '2' character”.  Unfortunately, our framework is incapable of defining a parser according to this rule.  We have no facility for saying “either this or that”.  This is where the disjunctive combinator comes into play.

In boolean logic, a disjunction is defined according to the following truth table:

P Q P V Q
T T T
T F T
F T T
F F F

In other words, the disjunction is true if one or both of its component predicates are true.  This is exactly the sort of combinator we need to bring our framework to full LL(*) potential.  We need to define a parser combinator which takes two parsers as parameters, trying each of them in order.  If the first parser succeeds, we yield its value; otherwise, we try the second parser and return its Result (whether Success or Failure).  Thus, our disjunctive combinator should yield a parser which succeeds if and only if one of its component parsers succeeds.  This is very easily accomplished:

class DisParser[+A](left: Parser[A], right: Parser[A]) extends Parser[A] {
  def apply(s: Stream[Character]) = left(s) match {
    case res: Success => res
    case _: Failure => right(s)
  }
}

Once again, we can beautify the syntax a little bit by adding an operator to the Parser super-trait:

trait Parser[+A] extends (Stream[Character]=>Result[A]) {
  def ~[B](that: =>Parser[B]) = new SequenceParser(this, that)
 
  def |(that: Parser[A]) = new DisParser(this, that)
}

…is that all?

It’s almost as if by magic, but the addition of the disjunctive combinator to the sequential actually turns our framework into something really special, capable of chewing through any LL(*) grammar.  Just in case you don’t believe me, consider the grammar for the pure-untyped lambda calculus, expressed using our framework (alph definition partially elided for brevity):

object LambdaCalc extends RegexpParsers {
 
  def expr: Parser[Any] = term ~ (expr | "")
 
  def term = (
      "fn" ~ id ~ "=>" ~ expr
    | id
  )
 
  val alph = "a"|"b"|"c"|...|"X"|"Y"|"Z"
  val num = "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"
 
  def id = alph ~ idSuffix
 
  def idSuffix = (
      (alph | num) ~ idSuffix
    | ""
  )
}

While this grammar may seem a bit obfuscated, it is only because I had to avoid the use of regular expressions to define the ID rule.  Instead, I used a combination of sequential and disjunctive combinators to produce a Parser which matches the desired pattern.  Note that the “...” is not some special syntax, but rather my laziness and wish to avoid a code snippet 310 characters wide.

We can also use our framework to define some other, useful combinators such as opt and * (used in the initial example). Specifically:

trait Parser[+A] extends (Stream[Character]=>Result[A]) {
  ...
 
  def *: Parser[List[A]] = (
      this ~ * ^^ { case (a, b) => a :: b }
    | "" ^^^ Nil
  )
}
 
object RegexpParsers {
  ...
 
  def opt[A](p: Parser[A]) = (
      p ^^ { Some(_) }
    | "" ^^^ None
  )
}

Readers who have managed to stay awake to this point may notice that I’m actually cheating a bit in these definitions.  Specifically, I’m using the ^^ and ^^^ combinators.  These are the semantic action combinators which I promised to avoid discussing.  However, for the sake of completeness, I’ll include the sources and leave you to figure out the rest:

trait Parser[+A] extends (Stream[Character]=>Result[A]) { outer =>
  ...
 
  def ^^[B](f: A=>B) = new Parser[B] {
    def apply(s: Stream[Character]) = outer(s) match {
      case Success(v, rem) => Success(f(v), rem)
      case f: Failure => f
    }
  }
 
  def ^^^[B](v: =>B) = new Parser[B] {
    def apply(s: Stream[Character]) = outer(s) match {
      case Success(_, rem) => Success(v, rem)
      case f: Failure => f
    }
  }
}

In short, these combinators are only interesting to people who want their parsers to give them a value upon completion (usually an AST).  In short, just about any useful application of parser combinators will require these combinators, but since we’re not planning to use our framework for anything useful, there is really no need.

One really interesting parser from our first example which is worthy of attention is the member rule.  If you recall, this was defined as follows:

def member = (
    "val" ~ ID ~ ":" ~ ID ~ "=" ~ expr
  | "var" ~ ID ~ ":" ~ ID ~ "=" ~ expr
  | "def" ~ ID ~ "(" ~ formals ~ ")" ~ ":" ~ ID ~ "=" ~ expr
  | "def" ~ ID ~ ":" ~ ID ~ "=" ~ expr
  | "type" ~ ID ~ "=" ~ ID
)

This is interesting for two reasons.  First: we have multiple disjunctions handled in the same rule, showing that disjunctive parsers chain just as nicely as do sequential.  But more importantly, our chain of disjunctions includes two parsers which have the same prefix ("def" ~ ID).  In other words, if we attempt to parse an input of “def a: B = 42“, one of these deeply nested parsers will erroneously match the input for the first two tokens.

This grammatical feature forces us to implement some sort of backtracking within our parser combinators.  Intuitively, the "def" ~ ID parser is going to successfully match “def a“, but the enclosing sequence ("def" ~ ID ~ "(") will fail as soon as the “:” token is reached.  At this point, the parser has to take two steps back in the token stream and try again with another parser, in this case, "def" ~ ID ~ ":" ~ ID ~ "=" ~ expr.  It is this feature which separates LL(*) parsing from LL(1) and LL(0).

The good news is that we already have this feature almost by accident.  Well, obviously not by accident since I put some careful planning into this article, but at no point so far did we actually set out to implement backtracking, and yet it has somehow dropped into our collective lap.  Consider once more the implementation of the disjunctive parser:

class DisParser[+A](left: Parser[A], right: Parser[A]) extends Parser[A] {
  def apply(s: Stream[Character]) = left(s) match {
    case res: Success => res
    case _: Failure => right(s)
  }
}

Notice what happens if left fails: it invokes the right parser passing the same Stream instance (s).  Recall that Stream is immutable, meaning that there is nothing left can do which could possibly change the value of s.  Each parser is merely grabbing characters from the head of the stream and then producing a new Stream which represents the remainder.  Parsers farther up the line (like our disjunctive parser) are still holding a reference to the stream prior to these “removals”.  This means that we don’t need to make any special effort to implement backtracking, it just falls out as a natural consequence of our use of the Stream data structure.  Isn’t that nifty?

Conclusion

Parser combinators are an incredibly clever bit of functional programming.  Every time I think about them, I am once again impressed by the ingenuity of their design and the simple elegance of their operation.  The fact that two combinators and a single parser can encode the vast diversity of LL(*) grammars is simply mind-boggling.  Despite their simplicity, parser combinators are capable of some very powerful parsing in a very clean and intuitive fashion.  That to me is magical.

Hacking Buildr: Interactive Shell Support

12
Jan
2009

Last week, we looked at the unfortunately-unexplored topic of Scala/Java joint compilation.  Specifically, we saw several different ways in which this functionality may be invoked covering a number of different tools.  Among these tools was Buildr, a fast Ruby-based drop-in replacement for Maven with a penchant for simple configuration.  In the article I mentioned that Buildr doesn’t actually have support for the Scala joint compiler out of the box.  In fact, this feature actually requires the use of a Buildr fork I’ve been using to experiment with different extensions.  Among these extensions is a feature I’ve been wanting from Buildr for a long time: the ability to launch a pre-configured interactive shell.

For those coming from a primarily-Java background, the concept of an interactive shell may seem a bit foreign.  Basically, an interactive shell — or REPL, as it is often called — is a line-by-line language interpreter which allows you to execute snippets of code with immediate result.  This has been a common tool in the hands of dynamic language enthusiasts since the days of LISP, but has only recently found its way into the world of mainstream static languages such as Scala.

interactive-shells.png

One of the most useful applications of these tools is the testing of code (particularly frameworks) before the implementations are fully completed.  For example, when working on my port of Clojure’s PersistentVector, I would often spin up a Scala shell to quickly test one aspect or another of the class.  As a minor productivity plug, JavaRebel is a truly invaluable tool for development of this variety.

The problem with this pattern of work is it requires the interactive shell in question to be pre-configured to include the project’s output directory on the CLASSPATH.  While this isn’t usually so bad, things can get very sticky when you’re working with a project which includes a large number of dependencies.  It isn’t too unreasonable to imagine shell invocations stretching into the tens of lines, just to spin up a “quick and dirty” test tool.

Further complicating affairs is the fact that many projects do away with the notion of fixed dependency paths and simply allow tools like Maven or Buildr to manage the CLASSPATH entirely out of sight.  In order to fire up a Scala shell for a project with any external dependencies, I must first manually read my buildfile, parsing out all of the artifacts in use.  Then I have to grope about in my ~/.m2/repository directory until I find the JARs in question.  Needless to say, the productivity benefits of this technique become extremely suspect after the first or second expedition.

For this reason, I strongly believe that the launch of an interactive shell should be the responsibility of the tool managing the dependencies, rather than that of the developer.  Note that Maven already has some support for shells in conjunction with certain languages (Scala among them), but it is as crude and verbose as the tool itself.  What I really want is to be able to invoke the following command and have the appropriate shell launched with a pre-configured CLASSPATH.  I shouldn’t have to worry about the language of my project, the location of my repository or even if the shell requires extra configuration on my platform.  The idea is that everything should all work auto-magically:

$ buildr shell

This is exactly what the interactive-shell branch of my Buildr fork is designed to accomplish.  Whenever the shell task is invoked, Buildr looked through the current project and attempts to guess the language involved.  This guesswork is required for a number of other features, so Buildr is actually pretty accurate in this area.  If the language in question is Groovy or Scala, then the desired shell is obvious.  Java does not have an integrated shell, which means that the default behavior on a Java project would be to raise an error.

However, the benefits of interactive shells are not limited to just the latest-and-greatest languages.  I often use a Scala shell with Java projects, and for certain things a JRuby shell as well (jirb).  Thus, my interactive shell extension also provides a mechanism to allow users to override the default shell on a per-project basis:

define 'my-project' do
  shell.using :clj
end

With this configuration, regardless of the language used by the compiler for “my-project”, Buildr will launch the Clojure REPL whenever the shell task is invoked.  The currently supported shells and their corresponding Buildr identifiers:

  • Clojure’s REPL — :clj
  • Groovy’s Shell — :groovysh
  • JRuby’s IRB — :jirb
  • Scala’s Shell — :scala

It is also possible to explicitly launch a specific shell.  This is useful for situations where you might want to use the Scala shell for testing some things and the JRuby IRB for quickly prototyping other ideas (I do this a lot).  The command to launch the JIRB shell in the context of my-project would be as follows:

$ buildr my-project:shell:jirb

As a special value-added feature, all of these shells (except for Groovy’s, which is weird) will be automatically configured to use JavaRebel for the project compilation target classes if it can be automatically detected.  This detection is performed by examining REBEL_HOME, JAVA_REBEL, JAVAREBEL and JAVAREBEL_HOME environment variables in order.  If any one of these points to a directory which contains javarebel.jar or points directly to javarebel.jar itself, the configuration is assumed and the respective shell invocation is appropriately modified.

javarebel-integration.png

Best of all, this support is implemented using a highly-extensible framework similar to Buildr’s own Compiler API.  It’s very easy for plugin implementors or even average developers to simply drop-in a new shell provider, perhaps for an internal language or even some unexpected application.  The core functionality of shell detection is integrated into Buildr itself, but this in no way hampers extensibility.  For example, I could easily create a third-party .rake plugin for Buildr which added support for a whole new language (e.g. Haskell).  In this plugin, I could also define a new shell provider which would be the default for projects using that language (e.g. GHCi).

Open Question

The good news is that this feature has been discussed extensively on the buildr-user mailing-list and the prevailing opinion seems to be that it should be folded into the main Buildr distribution.  Exactly what form this will take has yet to be decided.  The bad news is that there is still some dispute about a fundamental aspect of this feature’s operation.

The question revolves around what the exact behavior should be when the shell task is invoked.  Should Buildr detect the project (or sub-project) you are in and automatically configure the shell’s CLASSPATH accordingly?  This would give the interactive shell access to different classes depending on the current working directory.  Alternatively, should there be one all-powerful shell per-buildfile configured at the root level?  This would allow your shell to remain consistent throughout the project, regardless of your current directory.  However, it would also mean that some configuration would be required in order to enable the functionality.  (more details of this debate can be found on the mailing-list).

Additionally, what should the exact syntax be for invoking a specific shell?  Rake 0.8 allows tasks to take parameters enclosed within square brackets.  Thus, the syntax would be something more like the following:

$ buildr collection:shell[jirb]

In some sense, this is more logical since it reflects the fact that a single task, shell, is taking care of the work of invoking stuff.  On the other hand, it’s a little less consistent with the rest of Buildr’s tasks, particularly things like “test:TestClass” and so on.  This too is a matter which has yet to be settled.

All in all, this is a pretty experimental branch which is very open (and desirous) of outside input.  How would you use a feature like this?  Is there anything missing from what I have presented?  What design path should be we take with regards to project-local vs global shell configurations?

If you feel like adding your voice to the chorus, feel free to leave a comment or (better yet) post a reply on the mailing-list thread.  You’re also perfectly free to fork my remote branch at GitHub to better experiment with things yourself.  The root of the whole plate of spaghetti is the lib/buildr/shell.rb file.  Bon appetit!

Joint Compilation of Scala and Java Sources

5
Jan
2009

One of the features that the Groovy people like to flaunt is the joint compilation of .groovy and .java files.  This is a fantastically powerful concept which (among other things) allows for circular dependencies between Java, Groovy and back again.  Thus, you can have a Groovy class which extends a Java class which in turn extends another Groovy class.

All this is old news, but what you may not know is the fact that Scala is capable of the same thing.  The Scala/Java joint compilation mode is new in Scala 2.7.2, but despite the fact that this release has been out for more than two months, there is still a remarkable lack of tutorials and documentation regarding its usage.  Hence, this post…

Concepts

For starters, you need to know a little bit about how joint compilation works, both in Groovy and in Scala.  Our motivating example will be the following stimulating snippet:

// foo.scala
class Foo
 
class Baz extends Bar

…and the Java class:

// Bar.java
public class Bar extends Foo {}

If we try to compile foo.scala before Bar.java, the Scala compiler will issue a type error complaining that class Bar does not exist.  Similarly, if we attempt the to compile Bar.java first, the Java compiler will whine about the lack of a Foo class.  Now, there is actually a way to resolve this particular case (by splitting foo.scala into two separate files), but it’s easy to imagine other examples where the circular dependency is impossible to linearize.  For the sake of example, let’s just assume that this circular dependency is a problem and cannot be handled piece-meal.

In order for this to work, either the Scala compiler will need to know about class Bar before its compilation, or vice versa.  This implies that one of the compilers will need to be able to analyze sources which target the other.  Since Scala is the language in question, it only makes sense that it be the accommodating one (rather than javac).

What scalac has to do is literally parse and analyze all of the Scala sources it is given in addition to any Java sources which may also be supplied.  It doesn’t need to be a full fledged Java compiler, but it does have to know enough about the Java language to be able to produce an annotated structural AST for any Java source file.  Once this AST is available, circular dependencies may be handled in exactly the same way as circular dependencies internal to Scala sources (because all Scala and all Java classes are available simultaneously to the compiler).

Once the analysis phase of scalac has blessed the Scala AST, all of the Java nodes may be discarded.  At this point, circular dependencies have been resolved and all type errors have been handled.  Thus, there is no need to carry around useless class information.  Once scalac is done, both the Foo and the Baz classes will have produced resultant Foo.class and Baz.class output files.

However, we’re still not quite done yet.  Compilation has successfully completed, but if we try to run the application, we will receive a NoClassDefFoundError due to the fact that the Bar class has not actually been compiled.  Remember, scalac only analyzed it for the sake of the type checker, no actual bytecode was produced.  Bar may even suffer from a compile error of some sort, as long as this error is within the method definitions, scalac isn’t going to catch it.

The final step is to invoke javac against the .java source files (the same ones we passed to scalac) adding scalac’s output directory to javac’s classpath.  Thus, javac will be able to find the Foo class that we just compiled so as to successfully (hopefully) compile the Bar class.  If all goes well, the final result will be three separate files: Foo.class, Bar.class and Baz.class.

Usage

Although the concepts are identical, Scala’s joint compilation works slightly differently from Groovy’s from a usage standpoint.  More specifically: scalac does not automatically invoke javac on the specified .java sources.  This means that you can perform “joint compilation” using scalac, but without invoking javac you will only receive the compiled Scala classes, the Java classes will be ignored (except by the type checker).  This design has some nice benefits, but it does mean that we usually need at least one extra command in our compilation process.

All of the following usage examples assume that you have defined the earlier example in the following hierarchy:

  • src
    • main
      • java
        • Bar.java
      • scala
        • foo.scala
  • target
    • classes

Command Line

# include both .scala AND .java files
scalac -d target/classes src/main/scala/*.scala src/main/java/*.java

javac -d target/classes 
      -classpath $SCALA_HOME/lib/scala-library.jar:target/classes 
       src/main/java/*.java

Ant

<target name="build">
    <scalac srcdir="src/main" destdir="target/classes">
        <include name="scala/**/*.scala"/>
        <include name="scala/**/*.java"/>
    </scalac>
 
    <javac srcdir="src/main/java" destdir="${scala.library}:target/classes" 
           classpath="target/classes"/>
</target>

Maven

One thing you gotta love about Maven: it’s fairly low on configuration for certain common tasks.  Given the above directory structure and the most recent version of the maven-scala-plugin, the following command should be sufficient for joint compilation:

mvn compile

Unfortunately, there have been some problems reported with the default configuration and complex inter-dependencies between Scala and Java (and back again).  I’m not a Maven…maven, so I can’t help too much, but as I understand things, this POM fragment seems to work well:

<plugin>
    <groupId>org.scala-tools</groupId>
    <artifactId>maven-scala-plugin</artifactId>
 
    <executions>
        <execution>
            <id>compile</id>
            <goals>
            <goal>compile</goal>
            </goals>
            <phase>compile</phase>
        </execution>
 
        <execution>
            <id>test-compile</id>
            <goals>
            <goal>testCompile</goal>
            </goals>
            <phase>test-compile</phase>
        </execution>
 
        <execution>
            <phase>process-resources</phase>
            <goals>
            <goal>compile</goal>
            </goals>
        </execution>
    </executions>
</plugin>
 
<plugin>
    <artifactId>maven-compiler-plugin</artifactId>
    <configuration>
        <source>1.5</source>
        <target>1.5</target>
    </configuration>
</plugin>

You can find more information on the mailing-list thread.

Buildr

Joint compilation for mixed Scala / Java projects has been a long-standing request of mine in Buildr’s JIRA.  However, because it’s not a high priority issue, the developers were never able to address it themselves.  Of course, that doesn’t stop the rest of us from pitching in!

I had a little free time yesterday afternoon, so I decided to blow it by hacking out a quick implementation of joint Scala compilation in Buildr, based on its pre-existing support for joint compilation in Groovy projects.  All of my work is available in my Buildr fork on GitHub.  This also includes some other unfinished goodies, so if you want only the joint compilation, clone just the scala-joint-compilation branch.

Once you have Buildr’s full sources, cd into the directory and enter the following command:

rake setup install

You may need to gem install a few packages.  Further, the exact steps required may be slightly different on different platforms.  You can find more details on Buildr’s project page.

With this highly-unstable version of Buildr installed on your unsuspecting system, you should now be able to make the following addition to your buildfile (assuming the directory structure given earlier):

require 'buildr/scala'
 
# rest of the file...

Just like Buildr’s joint compilation for Groovy, you must explicitly require the language, otherwise important things will break.  With this slight modification, you should be able to build your project as per normal:

buildr

This support is so bleeding-edge, I don’t even think that it’s safe to call it “pre-alpha”.  If you run into any problems, feel free to shoot me an email or comment on the issue.

Conclusion

Joint compilation of Java and Scala sources is a profound addition to the Scala feature list, making it significantly easier to use Scala alongside Java in pre-existing or future projects.  With this support, it is finally possible to use Scala as a truly drop-in replacement for Java without modifying the existing infrastructure beyond the CLASSPATH.  Hopefully this article has served to bring slightly more exposure to this feature, as well as provide some much-needed documentation on its use.

Introduction to Automated Proof Verification with SASyLF

1
Dec
2008

Doesn’t that title just get the blood pumping?  Proof verification has a reputation for being an inordinately academic subject.  In fact, even within scholarly (otherwise known as “unrealistically intelligent“) circles, the automated verification of proofs is known mainly as a complex, ugly and difficult task often not worth the effort.  This is a shame really, because rigorous proofs are at the very core of both mathematics and computer science.  We are nothing without logic (paraphrased contrapositive from Descartes).  Believe it or not, understanding basic proof techniques will be of tremendous aid to your cognitive process, even when working on slightly less ethereal problems such as how to get the freakin’ login page to work properly.

Well, if you made it all the way to the second paragraph, then you either believe me when I say that this is legitimately useful (and cool!) stuff, or you’re just plain bored.  Either way, read on as we commence our exciting journey into the land of rigorous proofs!

SASyLF Crash Course

If you’re at all familiar with the somewhat-specialized field of proof verification, you probably know that SASyLF (pronounced “sassy elf”) is not the most widely used tool for the job.  In fact, it may very well be the least well-known.  More commonly, proofs that require automatic verification are written in Twelf or Coq.  Both of these are fine tools and capable of a lot more than SASyLF, but they can also be extremely difficult to use.  One of the primary motivations behind SASyLF was to produce a tool which was easier to learn, had a higher level syntax (easier to read) and which gave more helpful error messages than Twelf.  The main idea behind these convolutions was to produce a tool which was more suitable for use in the classroom.

The main design decision which sets SASyLF apart from Twelf is the way in which proofs are expressed.  As I understand it, Twelf exploits Curry-Howard correspondence to represent proofs implicitly in the types of a functional program (update: this is incorrect; see below).  While this can be very powerful, it’s not the most intuitive way to think about a proof.  Eschewing this approach, SASyLF expresses proofs using unification (very similar to Prolog) and defines inference rules explicitly in a natural-language style.

There are three main components to a SASyLF proof:

  • Syntax
  • Judgments
  • Theorems/Lemmas

Intuitively enough, the syntax section is where we express the grammar for the language used throughout our proof.  This grammar is expressed very naturally using BNF, just as if we were defining the language mathematically for a hand-written proof.  Left-recursion is allowed, as is right-recursion, arbitrary symbols, ambiguity and so on.  SASyLF’s parser is mind bogglingly powerful, capable of chewing threw just about any syntax you throw at it.  The main restriction is that you cannot use parentheses, square brackets ([]), pipes (|) or periods (.) in your syntax.  The pure-untyped lambda calculus defined in SASyLF would look something like this:

t ::= fn x => t[x]
    | t t
    | x

I said we couldn’t use brackets, but that’s only because SASyLF assigns some special magic to these operators.  In a nutshell, they allow the above definition of lambda calculus to ignore all of the issues associated with variable name freshness and context.  For simplicity’s sake, that’s about as far as I’m going to go into these mysterious little thingies.

The judgments section is where we define our inference rules.  Just as if we were defining these rules by hand, the syntax has the conditionals above a line of hyphens with the conclusion below.  The label for the rule goes to the right of the “line”.  What could be more natural?

judgment eval: t -> t

t1 -> t1′
--------------- E-Beta1
t1 t2 -> t1′ t2

The judgment syntax is what defines the syntax for the -> “operator”.  Once SASyLF sees this, it knows that we may define rules of the form t -> t, where t is defined by the syntax section.  Further on down, SASyLF sees our E-Beta1 rule.  Each of the tokens within this rule (aside from ->) begins with “t“.  From this, SASyLF is able to infer that we mean “a term as defined previously”.  Thus, this rule is syntactically valid according to our evaluation judgment and the syntax given above.

Of course, theorems are where you will find the real meat of any proof (I’m using the word “proof” very loosely to mean the collection of proven theorems and lemmas which indicates some fact(s) about a language).  SASyLF wouldn’t be a very complete proof verification system without support for some form of proving.  Once again, the syntax is extremely natural language, almost to the point of being overly-verbose.  A simple theorem given the rules above plus a little would be to show that values cannot evaluate:

theorem eval-value-implies-contradiction:
    forall e: t -> t’
    forall v: t value
    exists contradiction .

    _: contradiction by unproved
end theorem

Note that contradiction is not more SASyLF magic.  We can actually define what it means to have a contradiction by adding the following lines to our judgment section:

judgment absurd: contradiction

In other words, we can have a contradiction, but there are no rules which allow us to get it.  In fact, the only way to have a contradiction is to somehow get SASyLF to the point where it sees that there are no cases which satisfy some set of proven facts (given the forall assumptions).  If SASyLF cannot find any cases to satisfy some rules, it allows us to derive anything at all, including judgments which have no corresponding rules.

Readers who have yet to fall asleep will notice that I cleverly elided a portion of the “theorem” code snippet.  That’s because there wasn’t really a way to prove that contradiction given the drastically abbreviated rules given in earlier samples.  Instead of proving anything, I used a special SASyLF justification, unproved, which allows the derivation of any fact given no input (very useful for testing incomplete proofs).  Lambda calculus isn’t much more complicated than what I showed, but it does require more than just an application context rule in its evaluation semantics.  In order to get a taste for SASyLF’s proof syntax, we’re going to need to look at a much simpler language.

Case Study: Integer Comparison

For this case study, we’re going to be working with simple counting numbers which start with 0 and then proceed upwards, each value expressed as the successor of its previous value.  Thus, the logical number 3 would be s s s 0.  Not a very useful language in the real world, but much easier to deal with in the field of proof verification.  The syntax for our natural numbers looks like this:

n ::= 0
    | s n

With this humble definition for n, we can go on to define the mathematical greater-than comparison using two rules under a single judgment:

judgment gt: n > n

------- gt-one
s n > n

n1 > n2
--------- gt-more
s n1 > n2

Believe it or not, this is all we need to do in terms of definition.  The first rule says that the successor of any number is greater than that same number (3 > 2).  The second rule states that if we already have two numbers, one greater than the other (12 > 4), then the successor of the greater number will still be greater than the lesser (13 > 4).  All very intuitive, but the real question is whether or not we can prove anything with these definitions.

An Easy Lemma

For openers, we can try something reasonably simple: prove that all non-zero numbers are greater than zero.  This is such a simple proof that we won’t even bother calling it a theorem, we will give it the lesser rank of “lemma”:

lemma all-gt-zero:
    forall n
    exists s n > 0 .

    _: s n > 0 by induction on n:
        case 0 is
            _: s 0 > 0 by rule gt-one
        end case

        case s n1 is
            g: s n1 > 0 by induction hypothesis on n1
            _: s s n1 > 0 by rule gt-more on g
        end case
    end induction
end lemma

In order to prove anything about n, we first need to “pull it apart” and find out what it’s made of.  To do that, we’re going to use induction.  We could also use case analysis, but that would only work if our proof didn’t require “recursion” (we’ll get to this in a minute).  There are two cases as given by the syntax for n: when n is “0“, and when n is “s n1“, where n1 is some other number.  We must prove that s n > 0 for both of these cases individually, otherwise our proof is not valid.

The first case is easy.  When n is 0, the proof is trivial using the rule gt-one.  Notice that within this case we are no longer proving s n > 0, but rather s 0 > 0.  This is the huge win brought by SASyLF’s unification: n is0” within this case.  Anything we already know about n, we also know about 0.  When we apply the rule gt-one, SASyLF sees that we are attempting to prove s n > n where n is “0“.  This is valid by the rule, so the verification passes.

The second case is where things get interesting.  We have that n is actually s n1, but that doesn’t really get us too much closer to proving s s n1 > 0 (remember, unification).  Fortunately, we can prove that s n1 > 0 because we’re writing a lemma at this very moment which prove that.  This is like writing a function to sum all the values in a list: when the list is empty, the result is trivial; but when the list has contents, we must take the head and then add it to the sum of the tail as computed by…ourself.  Induction is literally just recursion in logic.  Interestingly enough, SASyLF is smart enough to look at all of the inductive cases in your proof and verify that they are valid.  This is sort-of the equivalent of a compiler looking at your code and telling you whether or not it will lead to an infinite loop.

To get that s n1 > 0, we use the induction hypothesis, passing n1 as the “parameter”.  However, we’re not quite done yet.  We need to prove that s s n1 > 0 in order to unify with our original target (s n > 0).  Fortunately, we already have a rule that allows us to prove the successor of a number retains its greater-than status: gt-more.

However, gt-more has a condition in our definition.  It requires that we already have some fact n1 > n2 in order to obtain s n1 > n2.  In our case, we already have this fact (s n1 > 0), but we need to “pass” it to the rule.  SASyLF allows us to do this by giving our facts labels.  In this case, we have labeled the s n1 > 0 fact as “g“.  We take this fact, pack it up and send it to gt-more and it gives us back our final goal.

A Slightly Harder Theorem

A slightly more difficult task would be to prove that the successors of two numbers preserves their greater-than relationship.  Thus, if we know that 4 > 3, we can prove that 5 > 4.  More formally:

theorem gt-implies-gt-succ:
    forall g: n1 > n2
    exists s n1 > s n2 .

    _: s n1 > s n2 by unproved
end theorem

At first glance, this looks impossible since we don’t really have a rule dealing with s n on the right-hand side of the >-sign.  We can try to prove this one step at a time to see whether or not this intuition is correct.

Almost any lemma of interest is going to require induction, so immediately we jump to inducting on the only fact we have available: g.  Note that this is different from what we had in the earlier example.  Instead of getting the different syntactic cases, we’re looking at the the rules which would have allowed the input to be constructed.  After all, whoever “called” our theorem will have needed to somehow prove that n1 > n2, it would be helpful to know what facts they used to do that.  SASyLF allows this using the case rule syntax.  We start with the easy base case:

_: s n1 > s n2 by induction on g:
    case rule
        ------------ gt-one
        _: s n2 > n2
    is
        _: s s n2 > s n2 by rule gt-one
    end case
end induction

In this case, the term _: s n2 > n2 is unified with n1 > n2.  Thus, n1 is actually “s n2“.  This means that by unification, we are actually trying to prove s s n2 > s n2.  Fortunately, we have a rule for that.  If we let “n” be “s n2“, we can easily apply the rule gt-one to produce the desired result.

The second case is a bit trickier.  We start out by defining the case rule according to the inference rules given in the judgment section.  The only case left is gt-more, so we mindlessly copy/paste and correct the variables to suit our needs:

case rule
    g1: n11 > n2
    ------------- gt-more
    _: s n11 > n2
is
    _: s s n11 > s n2 by unproved
end case

In this case, n1 actually unifies with “s n11“.  This is probably the most annoying aspect of SASyLF: all of the syntax is determined by token prefix, so every number has to start with n, occasionally making proofs a little difficult to follow.

At this point, we need to derive s s n11 > s n2.  Since the left and right side of the > “operator” do not share a common sub-term, the only rule which could possibly help us is gt-more.  In order to apply this rule, we will somehow need to derive s n11 > s n2 (remember, gt-more takes a known greater-than relationship and then tells us something about how the left-successor relates to the right).  We can reflect this “bottom-up” step towards a proof in the following way:

case rule
    g1: n11 > n2
    ------------- gt-more
    _: s n11 > n2
is
    g: s n11 > s n2 by unproved
    _: s s n11 > s n2 by rule gt-more on g
end case

At this point, SASyLF will warn us about the unproved, but it will happily pass the rest of our theorem.  This technique for proof development is extremely handy in more complicated theorems.  The ability to find out whether or not your logic is sound even before it is complete can be very reassuring (in this way you can avoid chasing entirely down the wrong logical path).

In order to make this whole thing work, we need to somehow prove s n11 > s n2.  Fortunately, we just so happen to be working on a theorem which could prove this if we could supply n11 > n2.  This fact is conveniently available with the label of “g1“.  We feed this into the induction hypothesis to achieve our goal.  The final theorem looks like this:

theorem gt-implies-gt-succ:
    forall g: n1 > n2
    exists s n1 > s n2 .

    _: s n1 > s n2 by induction on g:
        case rule
            ------------ gt-one
            _: s n2 > n2
        is
            _: s s n2 > s n2 by rule gt-one
        end case

        case rule
            g1: n11 > n2
            ------------- gt-more
            _: s n11 > n2
        is
            g2: s n11 > s n2 by induction hypothesis on g1
            _: s s n11 > s n2 by rule gt-more on g2
        end case
    end induction
end theorem

Conclusion

I realize this was a bit of a deviation from my normal semi-practical posts, but I think it was still a journey well worth taking.  If you’re working as a serious developer in this industry, I strongly suggest that you find yourself a good formal language and/or type theory textbook (might I recommend?) and follow it through the best that you can.  The understanding of how languages are formally constructed and the mental circuits to create those proofs yourself will have a surprisingly powerful impact on your day-to-day programming.  Knowing how the properties of a language are proven provides tremendous illumination into why that language is the way it is and somtimes how it can be made better.

Credit: Examples in this post drawn rather unimaginatively from Dr. John Boyland’s excellent course in type theory.

Pipe Dream: Static Analysis for Ruby

30
Jun
2008

Yes, yes I know: Ruby is a dynamic language.  The word “static” is literally opposed to everything the language stands for.  With that said, I think that even Ruby development environments would benefit from some simple static analysis, just enough to catch the really idiotic errors.

Here’s the crux of the problem: people don’t test well.  Even with nice, behavior-driven development as facilitated by frameworks like RSpec, very few developers sufficiently test their code.  This isn’t just a problem with dynamic languages either, no one is safe from the test disability.  In some ways, it’s a product of laziness, but I think in most cases, good developers just don’t want to work on mundane problems.  It’s boring having to write unit test after unit test, checking and re-checking the same snippet of code with different input.

In some sense, it is this problem that compilers and static type systems try to avert, at least partially.  The very purpose of a static type system is to be able to prove certain things about your code simply by analysis.  By enabling the compiler to say things using the type system, the language is providing a safety net which filters out ninety percent of the annoying “no-brainer” mistakes.  A simple example would be invoking a method with the wrong parameters; or worse yet, misspelling the name of the method or type altogether.

The problem is that there are some problems which are more simply expressed in ways which are not provably sound.  In static languages, we get around this by casting, but such techniques are ugly and obviously contrived.  It is this problem which has given rise to the kingdom of dynamic languages; it is for this reason that most scripting languages have dynamic type systems: simple expression of algorithm without worrying about provability.  In fact, there are so many problems which do not fit nicely within most type systems that many developers have chosen to eschew static languages altogether, claiming that static typing just gets in the way.

Unfortunately, by abandoning static types, these languages lose that typo safety net.  It’s too easy to make a trivial mistake in a dynamic language, buried somewhere deep in the bowls of your application.  This mistake could easily be averted by a compiler with validating semantic analysis, but in a dynamic language, such a mistake could go unnoticed, conceivably even making it into production.  For this reason, most dynamic language proponents are also strong advocates of solid, comprehensive testing.  They have to be, for without such testing, one should never trust dynamic code in a production system (or any code, for that matter, but especially the unchecked dynamic variety).

Most large, production systems written in languages like Ruby or Groovy have large test suites which sometimes take hours to run.  These suites are extremely fine-grained, optimally checking every line of code with every possible kind of input, so as to be sure that mistakes are caught.  This is where the flexibility of dynamic typing really comes back to haunt you: extra testing is required to ensure that silly mistakes don’t slip through.  The irony is that a lot of developers using dynamic languages do so to get away from the “nuisance” of compilation, when all they have done is trade one inconvenience for another (testing).

Given this situation, it’s not unreasonable to conclude that what dynamic languages really need is a tool which can look through code and find all of those brain-dead mistakes.  Such a tool could be run along with the normal test suite, finding and reporting errors in much the same way.  It wouldn’t really have to be a compiler, so the tool wouldn’t slow down the development process, it would just be an effective layer of automated white-box testing.

But how could such a thing be accomplished in a language like Ruby?  After all, it is a truly dynamic language.  Methods don’t even exist until runtime, and sometimes only if certain code paths are run.  Types are completely undeclared, and every object can potentially respond to any method.  The answer is to perform extremely permissive inference.

It was actually a recent post by James Ervin on the nomenclature of type systems which got me thinking along these lines.  It should be possible by static analysis to infer the structural type of any value based on its usage.  Consider:

def do_something(obj)
  if obj.to_i == 0
    obj[:test]
  else
    other = obj.find :name => 'Daniel'
    other.to_s
  end
end

Just by examining this code, we can say certain things about the types involved.  For instance, we know that obj must respond to the following methods:

  • to_i
  • [Symbol]
  • find(Hash)

In turn, we know that the find(Hash) method must return a value which defines to_s.  Of course, this last bit of information isn’t very useful, because every object defines that method, but it’s still worth the inference.  The really useful inference which comes out of to_s is the knowledge that this method sometimes returns a value of type String (making the assumption that to_s hasn’t been redefined to return a different type, which isn’t exactly a safe assumption).  At other times, do_something will return whatever value comes from the square bracket operator ([]) on obj.  This bit of information we must remember in the analysis.  We can’t just assume that this method will return a String all the time, even if to_s does because method return types need not be homogeneous in dynamic languages.

Now, at this point we have effectively built up a structural type which is accepted by do_something.  Literally, we have formalized in the analysis what our intuition has already told us about the method.  There are some gaps, but that is to be expected.  The key to this analysis is not attempting to be comprehensive.  Dynamic languages cannot be analyzed as if they were static, one must expect to have certain limitations.  In such situations where the analysis is insufficient, it must assume that the code is valid, otherwise there will be thousands of false positives in the error checking.

So what is it all good for?  Well, imagine that somewhere else in our application, we have the following bit of code:

do_something 42

This is something we know will fail, because we have a simple value (42) which has a nominal type we can easily infer.  A little bit of checking on this type reveals the fact that it does not define square brackets, nor does it define a find(Hash) method.  This finding could be reported as an error by the analysis engine.

Granted, we still have to account for the fact that Ruby has things like method_missing and open classes, but all of this can fall into the fuzzy area of the analysis.  In situations where it might be alright to pass an object which does not satisfy a certain aspect of the structural type, the analysis must let it pass without question.

You can imagine how this analysis could traverse the entire source tree, making the strictest inferences it can and allowing for dynamic fuzziness where applicable.  Since the full sources of every Ruby function, class and module are available at runtime, analysis could be performed without any undue concern regarding obfuscation or parsing of binaries.  Conceivably, most trivial errors could be caught without any tests being written, taking some of the burden off of the developer.  There is a slight concern that developers would build up a false sense of security regarding their testing (or lack thereof), but I think we just have to trust that won’t happen, or won’t last long if it does.

Most advanced Ruby toolsets already have an analysis somewhat similar to the one I outlined.  NetBeans Ruby for example has some fairly advanced nominal type inference to allow things like semantic highlighting and content assist.  But as far as I know, this type inference is only nominal, and fairly local at that.  The structural type inference that I am proposing could conceivably provide far better assurances and capabilities than mere nominal inference, especially if enhanced through successive iteration and a more “global” approach (similar to Hindley/Milner in static languages).

One thing is certain, it isn’t working to just rely on developers being conscientious with their testing.  With the rapid rise in production systems running on dynamic languages, it is in all of our best interests to try to find a way to make these systems more stable and reliable.  The best way to do this is to start with code assurance and try to make it a little less painful to catch mistakes before deployment.