Skip to content

Pipe Dream: Static Analysis for Ruby

30
Jun
2008

Yes, yes I know: Ruby is a dynamic language.  The word “static” is literally opposed to everything the language stands for.  With that said, I think that even Ruby development environments would benefit from some simple static analysis, just enough to catch the really idiotic errors.

Here’s the crux of the problem: people don’t test well.  Even with nice, behavior-driven development as facilitated by frameworks like RSpec, very few developers sufficiently test their code.  This isn’t just a problem with dynamic languages either, no one is safe from the test disability.  In some ways, it’s a product of laziness, but I think in most cases, good developers just don’t want to work on mundane problems.  It’s boring having to write unit test after unit test, checking and re-checking the same snippet of code with different input.

In some sense, it is this problem that compilers and static type systems try to avert, at least partially.  The very purpose of a static type system is to be able to prove certain things about your code simply by analysis.  By enabling the compiler to say things using the type system, the language is providing a safety net which filters out ninety percent of the annoying “no-brainer” mistakes.  A simple example would be invoking a method with the wrong parameters; or worse yet, misspelling the name of the method or type altogether.

The problem is that there are some problems which are more simply expressed in ways which are not provably sound.  In static languages, we get around this by casting, but such techniques are ugly and obviously contrived.  It is this problem which has given rise to the kingdom of dynamic languages; it is for this reason that most scripting languages have dynamic type systems: simple expression of algorithm without worrying about provability.  In fact, there are so many problems which do not fit nicely within most type systems that many developers have chosen to eschew static languages altogether, claiming that static typing just gets in the way.

Unfortunately, by abandoning static types, these languages lose that typo safety net.  It’s too easy to make a trivial mistake in a dynamic language, buried somewhere deep in the bowls of your application.  This mistake could easily be averted by a compiler with validating semantic analysis, but in a dynamic language, such a mistake could go unnoticed, conceivably even making it into production.  For this reason, most dynamic language proponents are also strong advocates of solid, comprehensive testing.  They have to be, for without such testing, one should never trust dynamic code in a production system (or any code, for that matter, but especially the unchecked dynamic variety).

Most large, production systems written in languages like Ruby or Groovy have large test suites which sometimes take hours to run.  These suites are extremely fine-grained, optimally checking every line of code with every possible kind of input, so as to be sure that mistakes are caught.  This is where the flexibility of dynamic typing really comes back to haunt you: extra testing is required to ensure that silly mistakes don’t slip through.  The irony is that a lot of developers using dynamic languages do so to get away from the “nuisance” of compilation, when all they have done is trade one inconvenience for another (testing).

Given this situation, it’s not unreasonable to conclude that what dynamic languages really need is a tool which can look through code and find all of those brain-dead mistakes.  Such a tool could be run along with the normal test suite, finding and reporting errors in much the same way.  It wouldn’t really have to be a compiler, so the tool wouldn’t slow down the development process, it would just be an effective layer of automated white-box testing.

But how could such a thing be accomplished in a language like Ruby?  After all, it is a truly dynamic language.  Methods don’t even exist until runtime, and sometimes only if certain code paths are run.  Types are completely undeclared, and every object can potentially respond to any method.  The answer is to perform extremely permissive inference.

It was actually a recent post by James Ervin on the nomenclature of type systems which got me thinking along these lines.  It should be possible by static analysis to infer the structural type of any value based on its usage.  Consider:

def do_something(obj)
  if obj.to_i == 0
    obj[:test]
  else
    other = obj.find :name => 'Daniel'
    other.to_s
  end
end

Just by examining this code, we can say certain things about the types involved.  For instance, we know that obj must respond to the following methods:

  • to_i
  • [Symbol]
  • find(Hash)

In turn, we know that the find(Hash) method must return a value which defines to_s.  Of course, this last bit of information isn’t very useful, because every object defines that method, but it’s still worth the inference.  The really useful inference which comes out of to_s is the knowledge that this method sometimes returns a value of type String (making the assumption that to_s hasn’t been redefined to return a different type, which isn’t exactly a safe assumption).  At other times, do_something will return whatever value comes from the square bracket operator ([]) on obj.  This bit of information we must remember in the analysis.  We can’t just assume that this method will return a String all the time, even if to_s does because method return types need not be homogeneous in dynamic languages.

Now, at this point we have effectively built up a structural type which is accepted by do_something.  Literally, we have formalized in the analysis what our intuition has already told us about the method.  There are some gaps, but that is to be expected.  The key to this analysis is not attempting to be comprehensive.  Dynamic languages cannot be analyzed as if they were static, one must expect to have certain limitations.  In such situations where the analysis is insufficient, it must assume that the code is valid, otherwise there will be thousands of false positives in the error checking.

So what is it all good for?  Well, imagine that somewhere else in our application, we have the following bit of code:

do_something 42

This is something we know will fail, because we have a simple value (42) which has a nominal type we can easily infer.  A little bit of checking on this type reveals the fact that it does not define square brackets, nor does it define a find(Hash) method.  This finding could be reported as an error by the analysis engine.

Granted, we still have to account for the fact that Ruby has things like method_missing and open classes, but all of this can fall into the fuzzy area of the analysis.  In situations where it might be alright to pass an object which does not satisfy a certain aspect of the structural type, the analysis must let it pass without question.

You can imagine how this analysis could traverse the entire source tree, making the strictest inferences it can and allowing for dynamic fuzziness where applicable.  Since the full sources of every Ruby function, class and module are available at runtime, analysis could be performed without any undue concern regarding obfuscation or parsing of binaries.  Conceivably, most trivial errors could be caught without any tests being written, taking some of the burden off of the developer.  There is a slight concern that developers would build up a false sense of security regarding their testing (or lack thereof), but I think we just have to trust that won’t happen, or won’t last long if it does.

Most advanced Ruby toolsets already have an analysis somewhat similar to the one I outlined.  NetBeans Ruby for example has some fairly advanced nominal type inference to allow things like semantic highlighting and content assist.  But as far as I know, this type inference is only nominal, and fairly local at that.  The structural type inference that I am proposing could conceivably provide far better assurances and capabilities than mere nominal inference, especially if enhanced through successive iteration and a more “global” approach (similar to Hindley/Milner in static languages).

One thing is certain, it isn’t working to just rely on developers being conscientious with their testing.  With the rapid rise in production systems running on dynamic languages, it is in all of our best interests to try to find a way to make these systems more stable and reliable.  The best way to do this is to start with code assurance and try to make it a little less painful to catch mistakes before deployment.

The Need for a Common Compiler Framework

23
Jun
2008

In recent years, we have seen a dramatic rise in the number of languages used in mainstream projects.  In particular, languages which run on the JVM or CLR have become quite popular (probably because sane people hate dealing with x86 assembly).  Naturally, such languages prefer to interoperate with other languages built on these core platforms, particularly Java and C# (respectively).  Collectively, years of effort have been put into devising and implementing better ways of working with libraries written in these “parent languages”.  The problem is that such efforts are crippled by one fundamental limitation: circular dependencies.

Let’s take Scala as an example.  Of all of the JVM languages, this one probably has the potential for the tightest integration with Java.  Even Groovy, which is renowned for its integration, still falls short in many key areas.  (generics, anyone?)  With Scala, every class is a Java class, every method is a Java method, and there is no API which cannot be accessed from Java as natively as any other.  For example, I can write a simple linked list implementation in Scala and then use it in Java without any fuss whatsoever (warning: untested sample):

class LinkedList[T] {
  private var root: Node = _
 
  def add(data: T) = {
    val insert = Node(data, null)
 
    if (root == null) {
      root = insert
    } else {
      root.next = insert
    }
 
    this
  }
 
  def get(index: Int) = {
    def walk(node: Node, current: Int): T = {
      if (node == null) {
        throw new IndexOutOfBoundsException(index.toString)
      }
 
      if (current < index) {
        walk(node.next, current + 1)
      } else {
        node.data
      }
    }
 
    if (index < 0) {
      throw new IndexOutOfBoundsException(index.toString)
    }
 
    walk(root, 0)
  }
 
  def size = {
    def walk(node: Node): Int = if (node == null) 0 else 1 + walk(node.next)
 
    walk(root)
  }
 
  private case class Node(data: T, var next: Node)
}

Once this class is compiled, we can use it in our Java code just as if it were written within the language itself:

public class Driver {
    public static void main(String[] args) {
        LinkedList<String> list = new LinkedList<String>();
 
        for (String arg : args) {
            list.add(arg);
        }
 
        System.out.println("List has size: " + list.size());
 
        for (int i = 0; i < list.size(); i++) {
            System.out.println(list.get(i).trim());
        }
    }
}

Impressively seamless interoperability!  We actually could have gotten really fancy and thrown in some operator overloading.  Obviously, Java wouldn’t have been able to use the operators themselves, but it still would have been able to call them just like normal Java instance methods.  Using Scala in this way, we can get all the advantages of its concise syntax and slick design without really abandoning our Java code base.

The problem comes in when we try to satisfy more complex cases.  Groovy proponents often trot out the example of a Java class inherited by a Groovy class which is in turn inherited by another Java class.  In Scala, that would be doing something like this:

public class Shape {
    public abstract void draw(Canvas c);
}
class Rectangle(val width: Int, val height: Int) extends Shape {
  override def draw(c: Canvas) {
    // ...
  }
}
public class Square extends Rectangle {
    public Square(int size) {
        super(size, size);
    }
}

Unfortunately, this isn’t exactly possible in Scala.  Well, I take that back.  We can cheat a bit and first compile Shape using javac, then compile Rectangle using scalac and finally Square using javac, but that would be quite nasty indeed.  What’s worse is such a technique would completely fall over if the Canvas class were to have a dependency on Rectangle, something which isn’t too hard to imagine.  In short, Scala is bound by the limitations of a separate compiler, as are most languages on the JVM.

Groovy solves this problem by building their own Java compiler into groovyc, thus allowing the compilation of both Java and Groovy sources within the same process.  This solves the problem of circular references because neither set of sources is completely compiled before the other.  It’s a nice solution, and one which Scala will be adopting in an upcoming release of its compiler.  However, it doesn’t really solve everything.

Consider a more complex scenario.  Imagine we have Java class Shape, which is extended by Scala class Rectangle and Groovy class Circle.  Imagine also that class Canvas has a dependency on both Rectangle and Circle, perhaps for some special graphics optimizations.  Suddenly we have a three-way circular dependency and no way of resolving it without a compiler which can handle all three languages: Java, Groovy and Scala.  This is starting to become a bit more interesting.

Of course, we can solve this problem in the same way we solved the Groovy-Java dependence problem: just add support to the compiler!  Unfortunately, it may have been trivial to implement a Java compiler as part of groovyc, but Scala is a much more difficult language from a compiler’s point of view.  But even supposing that we do create an integrated Scala compiler, we still haven’t solved the problem.  It’s not difficult to imagine throwing another language into the mix; Clojure, for example.  Do we keep going, tacking languages onto our once-Groovy compiler until we support everything usable on the JVM?  It should be obvious why this is a bad plan.

A more viable solution would be to create a common compiler framework, one which would be used as the basis for all JVM languages.  This framework would have common abstractions for things like name resolution and type checking.  Instead of creating an entire compiler from scratch, every language would simply extend this core framework and implement their own language as some sort of module.  In this way, it would be easy to build up a custom set of modules which solve the needs of your project.  Since the compilers are modular and based on the same core framework, they would be able to handle simultaneous compilation of all JVM languages involved, effectively solving the circular dependency problem in a generalized fashion.

The framework could even make things easier on would-be compiler implementors by handling common operations like bytecode emission.  Fundamentally, all of these tightly-integrated languages are just different front-ends to a common backend: the JVM.  I haven’t looked at the sources, but I would imagine that there is a lot of work which had to be done in each compiler to solve problems which were already handled in another.

Of course, all this is purely speculative.  Everyone builds their compiler in a slightly different way (slightly => radically in the case of languages like Scala) and I wouldn’t imagine that it would be easy to build this sort of common compiler backend.  However, the technology is in place.  We already have nice module systems like OSGi, and we’re certainly no strangers to the work involved in building up a proper CLASSPATH for a given project.  Why should this be any different?

It’s not without precedent either.  GCC defines a common backend for a number of compilers, such as G++, GCJ and even an Objective-C compiler.  Granted, it’s neither as high-level nor as modular as we would need to solve circular dependencies, but it’s something to go on.

It will be interesting to see where the JVM language sphere is headed next.  The rapid emergence of so many new languages is leading to problems which will have to be addressed before the polyglot methodology will be truly accepted by the industry.  Some of the smartest people in the development community are working toward solutions; and whether they take my idea of a modular framework or not, somewhere along the line the problem of simultaneous compilation must be solved.

Formal Language Processing in Scala

16
Jun
2008

Quite some time ago, a smart cookie named Phillip Wadler authored a publication explaining the concept of “parser combinators”, a method of representing the well-understood concept of text parsing as the composition of atomic constructs which behaved according to monadic law.  This idea understandably captured the imaginations of a number of leading researchers, eventually developing into the Haskell parsec library.  Being a functional language with academic roots, it is understandable that Scala would have an implementation of the combinator concept included in its standard library.

The inclusion of this framework into Scala’s core library has brought advanced text parsing to within the reach of the “common man” perhaps for the first time.  Of course, tools like Bison, ANTLR and JavaCC have been around for a long time and gained quite a following, but such tools often have a steep learning curve and are quite intimidating for the casual experimenter.  Parser combinators are often much easier to work with and can streamline the transition from “experimenter” to “compiler hacker”.

Of course, all of this functional goodness does come at a price: flexibility.  The Scala implementation of parser combinators rules out any left-recursive grammars.  Right-recursiveness is supported (thanks to call-by-name parameters), but any production rule which is left-recursive creates an infinitely recursive call chain and overflows the stack.  It is possible to overcome this limitation with right-associative operators, but even if such an implementation existed in Scala, it wouldn’t do any good.  Scala’s parser combinators effectively produce an LL(*) parser instead of the far more flexible LR or even better, LALR such as would be produced by Bison or SableCC.  It is unknown to me (and everyone I’ve asked) whether or not it is even possible to produce an LR parser combinitorially, though the problem of left-recursion in LL(*) has been studied at length.

The good news is that you really don’t need all of the flexibility of an LR parser for most cases.  (in fact, you don’t theoretically need LR at all, it’s just easier for a lot of things)  Parser combinators are capable of satisfying many of the common parsing scenarios which face the average developer.  Unless you’re planning on building the next Java-killing scripting language, the framework should be just fine.

Of course, any time you talk about language parsing, the topic of language analysis and interpretation is bound to come up.  This article explores the construction of a simple interpreter for a trivial language.  I chose to create an interpreter rather than a compiler mainly for simplicity (I didn’t want to deal with bytecode generation in a blog post).  I also steer clear of a lot of the more knotty issues associated with semantic analysis, such as type checking, object-orientation, stack frames and the like.To be honest, I’ve been looking forward to writing this article ever since I read Debasish Ghosh’s excellent introduction to external DSL implementation in Scala.  Scala’s parser combinators are dreadfully under-documented, especially when you discount purely academic publications.  Hopefully, this article will help to rectify the situation in some small way.

Simpletalk: Iteration 1

Before we implement the interpreter, it is usually nice to have a language to interpret.  For the purposes of this article, we will be using an extremely contrived language based around the standard output (hence the name: “simple” “talk”).  The language isn’t complete (even in the later iterations) and you would be hard-pressed to find any useful application; but, it makes for a convenient running example to follow.

In the first iteration, Simpletalk is based around two commands: print and space.  Two hard-coded messages are available for output via print: HELLO and GOODBYE.  We will increase the complexity of the language later, allowing for literals and alternative constructs, but for the moment, this will suffice.  An example Simpletalk program which exercises all language features could be as follows:

print HELLO
space
space
print GOODBYE
space
print HELLO
print GOODBYE

The output would be the following:

Hello, World!

Farewell, sweet petunia!

Hello, World!
Farewell, sweet petunia!

As I said, not very useful.

The first thing we need to do is to define a context-free grammar using the combinator library included with Scala.  The language itself is simple enough that we could write the parser by hand, but it is easier to extend a language with a declarative definition.  Besides, I wanted an article on using parser combinators to build an interpreter…

object Simpletalk extends StandardTokenParsers with Application {
  lexical.reserved += ("print", "space", "HELLO", "GOODBYE")
 
  val input = Source.fromFile("input.talk").getLines.reduceLeft[String](_ + '\n' + _)
  val tokens = new lexical.Scanner(input)
 
  val result = phrase(program)(tokens)
 
  // grammar starts here
  def program = stmt+
 
  def stmt = ( "print" ~ greeting
             | "space" )
 
  def greeting = ( "HELLO"
                 | "GOODBYE" )
}

All of this is fairly standard EBNF notation.  The one critical bit of syntax here is the tilde (~) operator, shown separating the "print" and greeting tokens.  This method is the concatenation operator for the parsers.  Literally it means: first "print", then parse greeting, whatever that entails.  What would normally be terminals in a context-free grammar are represented by full Scala methods.  Type inference makes the syntax extremely concise.

It’s also worth noticing the use of the unary plus operator (+) to specify a repetition with one or more occurrences.  This is standard EBNF and implemented as perfectly valid Scala within the combinator library.  We could have also used the rep1(...) method, but I prefer the operator simply because it is more notational.

Looking back up toward the top of the Simpletalk singleton, we see the use of lexical.reserved.  We must define the keywords used by our language to avoid the parser marking them as identifiers.  To save time and effort, we’re going to implement the interpreter by extending the StandardTokenParsers class.  This is nice because we get a lot of functionality for free (such as parsing of string literals), but we also have to make sure that our language is relatively Scala-like in its syntax.  In this case, that is not a problem.

Down the line, we initialize the scanner and use it to parse a result from the hard-coded file, input.talk.  It would seem that even if we wanted to use Simpletalk for some useful application, we would have to ensure that the entire script was contained in a single, rigidly defined file relative to the interpreter.

Defining the AST

Once the grammar has been created, we must create the classes which will define the abstract syntax tree.  Almost any language defines a grammar which can be logically represented as a tree.  This tree structure is desirable as it is far easier to work with than the raw token stream.  The tree nodes may be manipulated at a high-level in the interpreter, allowing us to implement advanced features like name resolution (which we handle in iteration 3).

At the root of our AST is the statement, or rather, a List of statements.  Currently, the only statements we need to be concerned with are print and space, so we will only need to define two classes to represent them in the AST.  These classes will extend the abstract superclass, Statement.

The Space class will be fairly simple, as the command requires no arguments.  However, Print will need to contain some high-level representation of its greeting.  Since there are only two greetings available and as they are hard-coded into the language, we can safely represent them with separate classes extending a common superclass, Greeting.  The full hierarchy looks like this:

sealed abstract class Statement
 
case class Print(greeting: Greeting) extends Statement
case class Space extends Statement
 
sealed abstract class Greeting {
  val text: String
}
 
case class Hello extends Greeting {
  override val text = "Hello, World!"
}
 
case class Goodbye extends Greeting {
  override val text = "Farewell, sweet petunia!"
}

Believe it or not, this is all we need to represent the language as it stands in tree form.  However, the combinator library does not simply look through the classpath and guess which classes might represent AST nodes, we must explicitly tell it how to convert the results of each parse into a node.  This is done using the ^^ and ^^^ methods:

def program = stmt+
 
def stmt = ( "print" ~ greeting ^^ { case _ ~ g => Print(g) }
           | "space" ^^^ Space() )
 
def greeting = ( "HELLO" ^^^ Hello()
               | "GOODBYE" ^^^ Goodbye() )

The ^^^ method takes a parameter as a literal value which will be returned if the parse is successful.  Thus, any time the space command is parsed, the result will be an instance of the Space class, defined here.  This is nicely compact and efficient, but it does not satisfy all cases.  The print command, for example, takes a greeting argument.  To allow for this, we use the ^^ method and pass it a Scala PartialFunction.  The partial function defines a pattern which is matched against the parse result.  If it is successful, then the inner expression is resolved (in this case, Print(g)) and the result is returned.  Since the Parser defined by the greeting method is already defined to return an instance of Greeting, we can safely pass the result of this parse as a parameter to the Print constructor.  Note that we need not define any node initialization for the program terminal as the + operator is already defined to return a list of whatever type it encapsulates (in this case, Statement).

The Interpreter

So far, we have been focused exclusively on the front side of the interpreter: input parsing.  Our parser is now capable of consuming and checking the textual statements from input.talk and producing a corresponding AST.  We must now write the code which walks the AST and executes each statement in turn.  The result is a fairly straightforward recursive deconstruction of a list, with each node corresponding to an invocation of println.

class Interpreter(tree: List[Statement]) {
  def run() {
    walkTree(tree)
  }
 
  private def walkTree(tree: List[Statement]) {
    tree match {
      case Print(greeting) :: rest => {
        println(greeting.text)
        walkTree(rest)
      }
 
      case Space() :: rest => {
        println()
        walkTree(rest)
      }
 
      case Nil => ()
    }
  }
}

This is where all that work we did constructing the AST begins to pay off.  We don’t even have to manually resolve the greeting constants, that can be handled polymorphically within the nodes themselves.  Actually, in a real interpreter, it probably would be best to let the statement nodes handle the actual execution logic, thus enabling the interpreter to merely function as a dispatch core, remaining relatively agnostic of language semantics.

The final piece we need to tie this all together is a bit of logic handling the parse result (assuming it is successful) and transferring it to the interpreter.  We can accomplish this using pattern matching in the Simpletalk application:

result match {
  case Success(tree, _) => new Interpreter(tree).run()
 
  case e: NoSuccess => {
    Console.err.println(e)
    exit(100)
  }
}

We could get a bit fancier with our error handling, but in this case it is easiest just to print the error and give up.  The error handling we have is sufficient for our experimental needs.  We can test this with the following input:

print errorHere
space

The result is the following TeX-like trace:

[1.7] failure: ``GOODBYE'' expected but identifier errorHere found

print errorHere

      ^

One of the advantages of LL parsing is the parser can automatically generate relatively accurate error messages, just by inspecting the grammar and comparing it to the input.  This is much more difficult with an LR parser, which allows the parse process to consume multiple production rules simultaneously.

Pat yourself on the back!  This is all that is required to wire up a very simple language interpreter.  The full source is linked at the bottom of the article.

Iteration 2

Now that we have a working language implementation, it’s time to expand upon it.  After all, we can’t leave things working for long, can we?  For iteration 2, we will add the ability to print arbitrary messages using string and numeric literals, as well as a simple loop construct.  This loop will demonstrate some of the real merits of representing the program as a tree rather than a simple token stream.  As for syntax, we will define an example of these features as follows:

print HELLO
print 42

space
repeat 10
  print GOODBYE
next

space
print "Adios!"

The result should be the following:

Hello, World!
42

Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!

Adios!

The first thing we will need to do to support these new features is update our grammar.  This is where we start to see the advantages of using a declarative API like Scala’s combinators as opposed to creating the parser by hand.  We will need to change stmt terminal (to accept the repeat command) as well as the greeting terminal (to allow string and numeric literals):

def stmt: Parser[Statement] = ( "print" ~ greeting ^^ { case _ ~ g => Print(g) }
                              | "space" ^^^ Space()
                              | "repeat" ~ numericLit ~ (stmt+) ~ "next" ^^ {
                                  case _ ~ times ~ stmts ~ _ => Repeat(times.toInt, stmts)
                                } )
 
def greeting = ( "HELLO" ^^^ Hello()
               | "GOODBYE" ^^^ Goodbye()
               | stringLit ^^ { case s => Literal(s) }
               | numericLit ^^ { case s => Literal(s) } )

Notice that we can no longer rely upon type inference in the stmt method, as it is now recursive.  This recursion is in the repeat rule, which contains a one-or-more repetition of stmt.  This is logical since we want repeat to contain other statements, including other instances of repeat.  The repeat rule also makes use of the numericLit terminal.  This is a rule which is defined for us as part of the StandardTokenParsers.  Technically, it is more accepting than we want since it will also allow decimals.  However, we don’t need to worry about such trivialities.  After all, this is just an experiment, right?

The numericLit and stringLit terminals are used again in two of the productions for greeting.  Both of these parsers resolve to instances of String, which we pattern match and encapsulate within a new AST node: Literal.  It makes sense for numericLit to resolve to String because Scala has no way of knowing how our specific language will handle numbers.

These are the new AST classes required to satisfy the language changes:

case class Repeat(times: Int, stmts: List[Statement]) extends Statement
 
case class Literal(override val text: String) extends Greeting

Literal merely needs to encapsulate a String resolved directly out of the parse, there is no processing required.  Repeat, on the other hand, has a bit more interest to it.  This class contains a list of Statement(s), as well as an iteration count, which will define the behavior of the repeat when executed.  This is our first example of a truly recursive AST structure. Repeat is defined as a subclass of Statement, and it contains a List of such Statement(s).  Thus, it is conceivable that a Repeat could contain another instance of Repeat, nested within its structure.  This is really the true power of the AST: the ability to represent a recursive grammar in a logical, high-level structure.

Of course, Interpreter also must be modified to support these new features; but because of our polymorphic Greeting design, we only need to worry about Repeat.  This node is easily handled by adding another pattern to our recursive match:

case Repeat(times, stmts) :: rest => {
  for (i <- 0 until times) {
    walkTree(stmts)
  }
 
  walkTree(rest)
}

Here we see the primary advantage to leaving the execution processing within Interpreter rather than polymorphically farming it out to the AST: direct access to the walkTree method.  Logically, each repeat statement contains a new Simpletalk program within itself.  Since we already have a method defined to interpret such programs, it only makes sense to use it!  The looping itself can be handled by a simple for-comprehension.  Following the loop, we deconstruct the list and move on to the next statement in the enclosing scope (which could be a loop itself).  This design is extremely flexible and capable of handling the fully recursive nature of our new language.

The only other change we need to make is to add our new keywords to the lexer, so that they are not parsed as identifiers:

lexical.reserved += ("print", "space", "repeat", "next", "HELLO", "GOODBYE")

Iteration 3

It’s time to move onto something moderately advanced.  So far, we’ve stuck to easy modifications like new structures and extra keywords.  A more complicated task would be the addition of variables and scoping.  For example, we might want a syntax something like this:

let y = HELLO

space
print y
let x = 42
print x

space
repeat 10
  let y = GOODBYE
  print y
next

space
print y

space
print "Adios!"

And the result:

Hello, World!
42

Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!

Hello, World!

Adios!

This is significantly more complicated than our previous iterations partially because it requires name resolution (and thus, some semantic analysis), but more importantly because it requires a generalization of expressions.  We already have expressions of a sort in the Greeting nodes.  These are not really general-purpose expressions as they do not resolve to a value which can be used in multiple contexts.  This was not previously a problem since we only had one context in which they could be resolved (print).  But now, we will need to resolve Greeting(s) for both print and let statements.

We will start with the AST this time.  We need to modify our Greeting superclass to allow for more complex resolutions than a static text value.  More than that, these resolutions no longer take place in isolation, but within a variable context (referencing environment).  This context will not be required for Literal, Hello or Goodbye expressions, but it will be essential to handle our new AST: Variable.

sealed abstract class Statement
 
case class Print(expr: Expression) extends Statement
case class Space extends Statement
 
case class Repeat(times: Int, stmts: List[Statement]) extends Statement
case class Let(val id: String, val expr: Expression) extends Statement
 
sealed abstract class Expression {
  def value(context: Context): String
}
 
case class Literal(text: String) extends Expression {
  override def value(context: Context) = text
}
 
case class Variable(id: String) extends Expression {
  override def value(context: Context) = {
    context.resolve(id) match {
      case Some(binding) => binding.expr.value(context)
      case None => throw new RuntimeException("Unknown identifier: " + id)
    }
  }
}
 
case class Hello extends Expression {
  override def value(context: Context) = "Hello, World!"
}
 
case class Goodbye extends Expression {
  override def value(context: Context) = "Farewell, sweet petunia!"
}

Notice that the Print and Let nodes both accept instances of Expression, rather than the old Greeting node.  This generalization is quite powerful, allowing variables to be assigned the result of other variables, literals or greetings.  Likewise, the print statement may also be used with variables, literals or greetings alike.

Actually, the real meat of the implementation is contained within the Context class (not shown).  This data structure will manage the gritty details of resolving variable names into let-bindings (which can then be resolved as expressions).  Additionally, Context must deal with all of the problems associated with nested scopes and name shadowing (remember that our motivation example shadows the definition of the y variable).

For the moment, we need not concern ourselves with reassignment (mutability).  Redefinition will be allowed, but simplicity in both the grammar and the interpreter calls for fully immutable constants, rather than true variables.  Additionally, this restriction allows our interpreter to easily make use of fully immutable data structures in its implementation.  Scala doesn’t really impose this as an implementation requirement, but it’s neat to be able to do.  Also, the pure-functional nature of the data structures provide greater assurance as to the correctness of the algorithm.

Here is the full implementation of both Context and the modified Interpreter:

class Interpreter(tree: List[Statement]) {
  def run() {
    walkTree(tree, EmptyContext)
  }
 
  private def walkTree(tree: List[Statement], context: Context) {
    tree match {
      case Print(expr) :: rest => {
        println(expr.value(context))
        walkTree(rest, context)
      }
 
      case Space() :: rest => {
        println()
        walkTree(rest, context)
      }
 
      case Repeat(times, stmts) :: rest => {
        for (i <- 0 until times) {
          walkTree(stmts, context.child)
        }
 
        walkTree(rest, context)
      }
 
      case (binding: Let) :: rest => walkTree(rest, context + binding)
 
      case Nil => ()
    }
  }
}
 
class Context(ids: Map[String, Let], parent: Option[Context]) {
  lazy val child = new Context(Map[String, Let](), Some(this))
 
  def +(binding: Let) = {
    val newIDs = ids + (binding.id -> binding)
    new Context(newIDs, parent)
  }
 
  def resolve(id: String): Option[Let] = {
    if (ids contains id) {
      Some(ids(id))
    } else {
      parent match {
        case Some(c) => c resolve id
        case None => None
      }
    }
  }
}
 
object EmptyContext extends Context(Map[String, Let](), None)

Context is really little more than a thin wrapper around an immutable Map from String to Let.  The resolve method attempts to resolve a given identifier within the local context.  If that fails, it recursively attempts resolution on the parent context, to which it has a reference.  This procedes all the way up to the root context of the program, which has no parent.  If a resolution still fails at this point, None is returned and an error is thrown from within Interpreter.

The only other method of significance within Context is the + operator, which is the means by which new bindings are added to the context.  More specifically, a new context is constructed from the old in which the new binding is present (remember, immutable data).  The child value represents a new Context which is nested within the current (such as within a repeat block).  This can be a constant because of the immutability of the structure.  Thus, a single context may contain many nested contexts; but if none of them contain new bindings, then they will all be the exact same instance.

Jumping up to Interpreter, we see how Context is actually used.  Because our language lacks forward-referencing, we don’t have to worry about multiple passes through the AST.  This simplifies things tremendously, enabling us to handle semantic analysis and interpretation in a single effective phase.  As we walk through the AST, we carry the current Context along with us.  Whenever we come across a new binding (a let statement), we create a new Context with the binding and recursively pass it to the next node interpretation.  Whenever we come to the definition of a new context (in this case, a repeat statement), the child value of the current context is retrieved and passed to the interpretation of all nested statements.  This Context is discarded once we resume interpretation of the current level.

The final implementation step is to handle the grammar for our new language features.  This is easily done by modifying our existing combinitor definitions.  Additionally, we must add a new keyword (”let”) to the list of reserved identifiers, as well as the equals sign (”=”) to the list of token delimiters.

lexical.reserved += ("print", "space", "repeat", "next", "let", "HELLO", "GOODBYE")
lexical.delimiters += ("=")
 
// ...
 
def program = stmt+
 
def stmt: Parser[Statement] = ( "print" ~ expr ^^ { case _ ~ e => Print(e) }
                              | "space" ^^^ Space()
                              | "repeat" ~ numericLit ~ (stmt+) ~ "next" ^^ {
                                  case _ ~ times ~ stmts ~ _ => Repeat(times.toInt, stmts)
                                } 
                              | "let" ~ ident ~ "=" ~ expr ^^ { 
                                  case _ ~ id ~ _ ~ e => Let(id, e) 
                                } )
 
def expr = ( "HELLO" ^^^ Hello()
           | "GOODBYE" ^^^ Goodbye()
           | stringLit ^^ { case s => Literal(s) }
           | numericLit ^^ { case s => Literal(s) }
           | ident ^^ { case id => Variable(id) } )

Once again, the StandardTokenParsers gives us a great deal of functionality for free.  In this case, we make use of the ident terminal, which represents a Scala-style identifier in our language.  This token occurs in two places: the let statement and as a raw expression.  Just like stringLit and numericLit, ident is resolved down to a String instance, which we store in Let and Variable nodes for future use.

Conclusion

This concludes my introduction to the subtle and fascinating world of language processing.  In truth, I’m rather upset that I had to leave so many things out of the article and that a lot of my explanations were rushed at best.  There’s just so much information to fit into so little space!  I strongly suggest that you experiment further with the techniques associated with interpreters and (more interestingly) compilers.  These studies have helped me immeasurably as a programmer, imparting a much deeper understanding of what goes on “under the surface” and why things are the way they are in languages.

Buildr Still Not Ready for Prime Time

12
May
2008

As some of you may know, I’ve been following the Buildr project with a fair degree of interest of late.  Just last week, Assaf announced their first release as an Apache project: Buildr 1.3.0.  Being of the inquisitive bend, I decided that this would be an excellent time to reevaluate its suitability for use in my projects, both commercial and personal.

A while back, I evaluated Buildr as a potential replacement for Ant and Maven2 and eventually came to the unfortunate conclusion that it just wasn’t ready.  At the time, Buildr was still hampered by a number of annoying limitations (such as being unable to reconfigure the default directory structure).  I’m happy to say that many of these problems have been fixed in the 1.3 release, as well as a large number of feature additions which I hadn’t thought to request.  Despite all that, I’m still forced to conclude that Buildr simply isn’t (yet) suitable as my build tool of choice.

Now before you stone me for utter heresy, try to hear me out.  I’m not dissing Buildr in any way, I really want to use it, but I just can’t justify moving over to it in the face of all of its current issues.  What’s really unfortunate is that all of these issues can be summed up with a single word: transitivity.

One of Maven’s killer features is the ability to resolve the entire transitive dependency graph.  Now I’ll grant that it does this with varying degrees of success, but for the most part it’s pretty smart about how it fixes up your CLASSPATH.  As of 1.3, Buildr claims to have experimental support for transitive dependency resolution, but judging from the problems I encountered in my experimentation, it’s barely even deserving of mention as an experiment, much less a full-fledged feature.

To understand why transitive dependencies are such a problem, it is of course necessary to first understand the definition of such.  In a word (well, several anyway), a transitive dependency is an inherited dependency, an artifact which is not depended upon directly by the project, but rather by one of its dependencies.  This definitions carries recursively to dependent parents, grand-parents and so on, thus defining a transitive dependency graph.  Consider it this way:

image

In the diagram, MyCoolProject is the artifact we are trying to compile.  The only dependencies we have actually specified for this artifact are the DBPool and Databinder artifacts.  However, the Databinder artifact has declared that it depends upon both the Wicket and the ActiveObjects artifacts.  ActiveObjects doesn’t depend upon anything, but Wicket has dependencies SLF4J and on the Servlets API.  Thus, our original goal, MyCoolProject, has transitive dependencies Wicket, SLF4J, Servlets and ActiveObjects.  Quite a bit more than we thought we were asking for when we declared our dependency on Databinder.

In general, this sort of transitive resolution is a good thing.  It means that instead of specifying six dependencies, we only had to specify two.  Furthermore, we observed the DRY principle by not re-specifying dependency information already contained within the various packages.  As with any “good thing”, there’s definitely a valid argument regarding its down sides, but on the whole I’m quite fond of this feature.

In Maven, this sort of thing happens by default, which can lead to some very confusing and subtle CLASSPATH problems (conflicting packages, etc).  Buildr takes a slightly different approach in 1.3 by forcing the use of the transitive method:

repositories.remote << 'http://www.ibiblio.org/maven2'
 
define 'MyCoolProject' do
  project.version = '1.0'
 
  compile.with transitive('xmlrpc:xmlrpc-server:jar:3.0')
 
  package :jar
end

It’s easy to see why Buildr is raising such a ruckus in the build genre.  It’s syntax is elegance itself, and since it’s actually a proper scripting language (Ruby), there’s really nothing you can’t do with it.  Having to remember to specify the default repository is a bit of a pain, but it’s certainly something I can live with.  The real problem with this example is a little less subtle: It doesn’t work.

If you create a buildfile with the above contents and then run the buildr command, the results will be something like the following:

Downloading org.apache.xmlrpc:xmlrpc:jar:3.0
rake aborted!
Failed to download org.apache.xmlrpc:xmlrpc:jar:3.0, tried the following repositories:
http://www.ibiblio.org/maven2/

(See full trace by running task with --trace)

After a fairly significant amount of digging, I managed to discover that this problem is caused by the fact that Buildr attempts to download a corresponding JAR file for every POM it resolves.  This seems logical until you consider that many large projects (including Databinder, Wicket, and Hibernate) define POM-only projects which exist for the sole purpose of creating transitive dependencies.  They’re organizational units, designed to allow projects to depend upon one super-artifact, rather than a dozen sub-projects.  It’s a very common practice, and one which Buildr completely fails to handle appropriately.

After some prodding on the Buildr dev mailing-list, Assaf admitted that this is an issue worth looking at and provided a temporary workaround (pending a rework of the functionality in 1.4):

def pom_only(a)
  artifact a do |task|
    mkpath File.dirname(task.to_s)
    Zip::ZipOutputStream.open task.to_s
  end
end

This was about the point that I started wondering if maybe Ant/Ivy would be a better bet, at least for the time being.  To use this workaround, you must call the pom_only method once for every POM-only project in the dependency graph.  Usually, this means you must invoke Buildr repeatedly and find the troublesome artifacts by trial and error.  Not exactly a “just works” solution.

Pressing forward however, I unearthed a deeper, even more insidious issue: an intermittent failure to generate Eclipse project meta.  I’m not sure if this is due to the POM-only dependencies or just bad juju, but whatever the reason, it’s annoying.  I’ve raised the issue on the Buildr mailing lists, but so far no response.  Basically, what happens is something like this:

C:\Users\Daniel Spiewak\Desktop\MyCoolProject> buildr eclipse
(in C:/Users/Daniel Spiewak/Desktop/MyCoolProject, development)
Completed in 0.499s
C:\Users\Daniel Spiewak\Desktop\MyCoolProject>

Not exactly the most helpful output.  In case you were wondering, this process did not create the Eclipse metadata.  It’s interesting to note that calling buildr idea (to create project metadata for IntelliJ) seems to work just fine.  Whatever causes the bug, it seems to be specific to just the Eclipse project generator.

Buildr is a remarkable project.  It shows potential to someday become the de-facto build system, possibly even unseating Ant.  Unfortunately, that day is not today.  There are too many odd wrinkles and unpredictable errors to really call it a “finished product”.  Hopefully, the Buildr dev team will continue their excellent work, eventually producing a tool worthy of serious consideration.  Until then, I guess that I’m (still) stuck with Ant.

Update: It seems that the issues with transitive dependencies have been resolved in the latest Buildr versions in their SVN.  I’m looking forward to when this all becomes stable and publicly consumable!

The Plague of Polyglotism

28
Apr
2008

For those of you who don’t know, polyglotism is not some weird religion but actually a growing trend in the programming industry.  In essence, it is the concept that one should not be confined to a single language for a given system or even a specific application.  With polyglot programming, a single project could use dozens of different languages, each for a different task to which they are uniquely well-suited.

As a basic example, we could write a Wicket component which makes use of Ruby’s RedCloth library for working with Textile.  Because of Scala’s flexible syntax, we can use it to perform the interop between Wicket and Ruby using an internal DSL:

class TextileLabel(id:String, model:IModel) extends WebComponent(id, model) with JRuby {
  require("textile_utils")
 
  override def onComponentTagBody(stream:MarkupStream, openTag:ComponentTag) {
    replaceComponentTagBody(markupStream, openTag, 
        'textilize(model.getObject().toString()))
  }
}
# textile_utils.rb
require 'redcloth'
 
def textilize(text)
  doc = RedCloth.new text
  doc.to_html
end

Warning: Untested code

We’re actually using three languages here, even though we only have source for two of them.  The Wicket library itself is written in Java, our component is written in Scala and we work with the RedCloth library in Ruby.    This is hardly the best example of polyglotism, but it suffices for a simple illustration.  The general idea is that you would apply this concept to a more serious project and perform more significant tasks in each of the various languages.

The Bad News

This is all well and good, but there’s a glaring problem with this philosophy of design: not everyone knows every language.  You may be a language aficionado, picking up everything from Scheme to Objective-C, but it’s only a very small percentage of developers who share that passion.  Many projects are composed of developers without extensive knowledge of diverse languages.  In fact, even with a really good sampling of talent, it’s doubtful you’ll have more than one or two people fluent in more than two languages.  And unfortunately, there’s this pesky concern we all love called “maintainability”.

Let’s pretend that Slava Pestov comes into your project as a consultant and decides that he’s going to apply the polyglot programming philosophy.  He writes a good portion of your application in Java, Lisp and some language called Factor, pockets his consultant’s fee and then moves on.  Now the code he wrote may have been phenomenally well-designed and really perfect for the task, but you’re going to have a very hard time finding a developer who can maintain it.  Let’s say that six months down the road, you decide that your widget really needs a red push button, rather than a green radio selector.  Either you need a developer who knows Factor (hint: there aren’t very many), or you need a developer who’s willing to learn it.  The thing is that most developers with the knowledge and motivation to learn a language have either already done so, or are familiar enough with the base concepts as to be capable of jumping right in.  These developers fall into that limited group of people fluent in many different languages, and as such are a rare find.

Now I’m not picking on Factor in any way, it’s a very interesting language, but it still isn’t very widespread in terms of developer expertise.  That’s really what this all comes down to: developer expertise.  Every time you make a language choice, you limit the pool of developers who are even capable of groking your code.  If I decide to build an application in Java, even assuming that’s the only language I use, I have still eliminated maybe 20% of all developers from ever touching the project.  If I make the decision to use Ruby for some parts of the application, while still using Java for the others, I’ve now factored that 80% down to maybe 35% (developers who know Java and Ruby).  Once I throw in Scala, that cuts it down still further (maybe at 15% now).  If I add a fourth language - for example, Haskell - I’ve now narrowed the field so far, that it’s doubtful I’ll find anyone capable of handling all aspects within a reasonable price range.  It’s the same problem as with framework choice, except that frameworks are much easier to learn than languages.

The polyglot ideal was really devised by a bunch of nerdy folks like me.  I love languages and would like nothing better than to get paid to learn half a dozen new ones (assuming I’m coming into a project with a strange combination I haven’t seen before).  However, as I understand the industry, that’s not a common sentiment.  So a very loud minority of developers (/me waves) has managed to forge a very hot methodology, one which excludes almost all of the hard-working developer community.  If I didn’t know better, I would be tempted to say that it was a self-serving industry ploy to foster exclusivity in the job market.

I want to work on multi-language projects as much as anyone, but I really don’t think it’s the best thing right now.  I’m working on a project now which has an aspect for which Scala would be absolutely perfect, but since I’m the only developer on hand who is remotely familiar with the language, I’m probably going to end up recommending against its adoption.  Consider carefully the ramifications of trying new languages on your own projects, you may not be doing future developers any favors by going down that path.