Skip to content

Formal Language Processing in Scala

16
Jun
2008

Quite some time ago, a smart cookie named Phillip Wadler authored a publication explaining the concept of “parser combinators”, a method of representing the well-understood concept of text parsing as the composition of atomic constructs which behaved according to monadic law.  This idea understandably captured the imaginations of a number of leading researchers, eventually developing into the Haskell parsec library.  Being a functional language with academic roots, it is understandable that Scala would have an implementation of the combinator concept included in its standard library.

The inclusion of this framework into Scala’s core library has brought advanced text parsing to within the reach of the “common man” perhaps for the first time.  Of course, tools like Bison, ANTLR and JavaCC have been around for a long time and gained quite a following, but such tools often have a steep learning curve and are quite intimidating for the casual experimenter.  Parser combinators are often much easier to work with and can streamline the transition from “experimenter” to “compiler hacker”.

Of course, all of this functional goodness does come at a price: flexibility.  The Scala implementation of parser combinators rules out any left-recursive grammars.  Right-recursiveness is supported (thanks to call-by-name parameters), but any production rule which is left-recursive creates an infinitely recursive call chain and overflows the stack.  It is possible to overcome this limitation with right-associative operators, but even if such an implementation existed in Scala, it wouldn’t do any good.  Scala’s parser combinators effectively produce an LL(*) parser instead of the far more flexible LR or even better, LALR such as would be produced by Bison or SableCC.  It is unknown to me (and everyone I’ve asked) whether or not it is even possible to produce an LR parser combinitorially, though the problem of left-recursion in LL(*) has been studied at length.

The good news is that you really don’t need all of the flexibility of an LR parser for most cases.  (in fact, you don’t theoretically need LR at all, it’s just easier for a lot of things)  Parser combinators are capable of satisfying many of the common parsing scenarios which face the average developer.  Unless you’re planning on building the next Java-killing scripting language, the framework should be just fine.

Of course, any time you talk about language parsing, the topic of language analysis and interpretation is bound to come up.  This article explores the construction of a simple interpreter for a trivial language.  I chose to create an interpreter rather than a compiler mainly for simplicity (I didn’t want to deal with bytecode generation in a blog post).  I also steer clear of a lot of the more knotty issues associated with semantic analysis, such as type checking, object-orientation, stack frames and the like.To be honest, I’ve been looking forward to writing this article ever since I read Debasish Ghosh’s excellent introduction to external DSL implementation in Scala.  Scala’s parser combinators are dreadfully under-documented, especially when you discount purely academic publications.  Hopefully, this article will help to rectify the situation in some small way.

Simpletalk: Iteration 1

Before we implement the interpreter, it is usually nice to have a language to interpret.  For the purposes of this article, we will be using an extremely contrived language based around the standard output (hence the name: “simple” “talk”).  The language isn’t complete (even in the later iterations) and you would be hard-pressed to find any useful application; but, it makes for a convenient running example to follow.

In the first iteration, Simpletalk is based around two commands: print and space.  Two hard-coded messages are available for output via print: HELLO and GOODBYE.  We will increase the complexity of the language later, allowing for literals and alternative constructs, but for the moment, this will suffice.  An example Simpletalk program which exercises all language features could be as follows:

print HELLO
space
space
print GOODBYE
space
print HELLO
print GOODBYE

The output would be the following:

Hello, World!

Farewell, sweet petunia!

Hello, World!
Farewell, sweet petunia!

As I said, not very useful.

The first thing we need to do is to define a context-free grammar using the combinator library included with Scala.  The language itself is simple enough that we could write the parser by hand, but it is easier to extend a language with a declarative definition.  Besides, I wanted an article on using parser combinators to build an interpreter…

object Simpletalk extends StandardTokenParsers with Application {
  lexical.reserved += ("print", "space", "HELLO", "GOODBYE")
 
  val input = Source.fromFile("input.talk").getLines.reduceLeft[String](_ + '\n' + _)
  val tokens = new lexical.Scanner(input)
 
  val result = phrase(program)(tokens)
 
  // grammar starts here
  def program = stmt+
 
  def stmt = ( "print" ~ greeting
             | "space" )
 
  def greeting = ( "HELLO"
                 | "GOODBYE" )
}

All of this is fairly standard EBNF notation.  The one critical bit of syntax here is the tilde (~) operator, shown separating the "print" and greeting tokens.  This method is the concatenation operator for the parsers.  Literally it means: first "print", then parse greeting, whatever that entails.  What would normally be terminals in a context-free grammar are represented by full Scala methods.  Type inference makes the syntax extremely concise.

It’s also worth noticing the use of the unary plus operator (+) to specify a repetition with one or more occurrences.  This is standard EBNF and implemented as perfectly valid Scala within the combinator library.  We could have also used the rep1(...) method, but I prefer the operator simply because it is more notational.

Looking back up toward the top of the Simpletalk singleton, we see the use of lexical.reserved.  We must define the keywords used by our language to avoid the parser marking them as identifiers.  To save time and effort, we’re going to implement the interpreter by extending the StandardTokenParsers class.  This is nice because we get a lot of functionality for free (such as parsing of string literals), but we also have to make sure that our language is relatively Scala-like in its syntax.  In this case, that is not a problem.

Down the line, we initialize the scanner and use it to parse a result from the hard-coded file, input.talk.  It would seem that even if we wanted to use Simpletalk for some useful application, we would have to ensure that the entire script was contained in a single, rigidly defined file relative to the interpreter.

Defining the AST

Once the grammar has been created, we must create the classes which will define the abstract syntax tree.  Almost any language defines a grammar which can be logically represented as a tree.  This tree structure is desirable as it is far easier to work with than the raw token stream.  The tree nodes may be manipulated at a high-level in the interpreter, allowing us to implement advanced features like name resolution (which we handle in iteration 3).

At the root of our AST is the statement, or rather, a List of statements.  Currently, the only statements we need to be concerned with are print and space, so we will only need to define two classes to represent them in the AST.  These classes will extend the abstract superclass, Statement.

The Space class will be fairly simple, as the command requires no arguments.  However, Print will need to contain some high-level representation of its greeting.  Since there are only two greetings available and as they are hard-coded into the language, we can safely represent them with separate classes extending a common superclass, Greeting.  The full hierarchy looks like this:

sealed abstract class Statement
 
case class Print(greeting: Greeting) extends Statement
case class Space extends Statement
 
sealed abstract class Greeting {
  val text: String
}
 
case class Hello extends Greeting {
  override val text = "Hello, World!"
}
 
case class Goodbye extends Greeting {
  override val text = "Farewell, sweet petunia!"
}

Believe it or not, this is all we need to represent the language as it stands in tree form.  However, the combinator library does not simply look through the classpath and guess which classes might represent AST nodes, we must explicitly tell it how to convert the results of each parse into a node.  This is done using the ^^ and ^^^ methods:

def program = stmt+
 
def stmt = ( "print" ~ greeting ^^ { case _ ~ g => Print(g) }
           | "space" ^^^ Space() )
 
def greeting = ( "HELLO" ^^^ Hello()
               | "GOODBYE" ^^^ Goodbye() )

The ^^^ method takes a parameter as a literal value which will be returned if the parse is successful.  Thus, any time the space command is parsed, the result will be an instance of the Space class, defined here.  This is nicely compact and efficient, but it does not satisfy all cases.  The print command, for example, takes a greeting argument.  To allow for this, we use the ^^ method and pass it a Scala PartialFunction.  The partial function defines a pattern which is matched against the parse result.  If it is successful, then the inner expression is resolved (in this case, Print(g)) and the result is returned.  Since the Parser defined by the greeting method is already defined to return an instance of Greeting, we can safely pass the result of this parse as a parameter to the Print constructor.  Note that we need not define any node initialization for the program terminal as the + operator is already defined to return a list of whatever type it encapsulates (in this case, Statement).

The Interpreter

So far, we have been focused exclusively on the front side of the interpreter: input parsing.  Our parser is now capable of consuming and checking the textual statements from input.talk and producing a corresponding AST.  We must now write the code which walks the AST and executes each statement in turn.  The result is a fairly straightforward recursive deconstruction of a list, with each node corresponding to an invocation of println.

class Interpreter(tree: List[Statement]) {
  def run() {
    walkTree(tree)
  }
 
  private def walkTree(tree: List[Statement]) {
    tree match {
      case Print(greeting) :: rest => {
        println(greeting.text)
        walkTree(rest)
      }
 
      case Space() :: rest => {
        println()
        walkTree(rest)
      }
 
      case Nil => ()
    }
  }
}

This is where all that work we did constructing the AST begins to pay off.  We don’t even have to manually resolve the greeting constants, that can be handled polymorphically within the nodes themselves.  Actually, in a real interpreter, it probably would be best to let the statement nodes handle the actual execution logic, thus enabling the interpreter to merely function as a dispatch core, remaining relatively agnostic of language semantics.

The final piece we need to tie this all together is a bit of logic handling the parse result (assuming it is successful) and transferring it to the interpreter.  We can accomplish this using pattern matching in the Simpletalk application:

result match {
  case Success(tree, _) => new Interpreter(tree).run()
 
  case e: NoSuccess => {
    Console.err.println(e)
    exit(100)
  }
}

We could get a bit fancier with our error handling, but in this case it is easiest just to print the error and give up.  The error handling we have is sufficient for our experimental needs.  We can test this with the following input:

print errorHere
space

The result is the following TeX-like trace:

[1.7] failure: ``GOODBYE'' expected but identifier errorHere found

print errorHere

      ^

One of the advantages of LL parsing is the parser can automatically generate relatively accurate error messages, just by inspecting the grammar and comparing it to the input.  This is much more difficult with an LR parser, which allows the parse process to consume multiple production rules simultaneously.

Pat yourself on the back!  This is all that is required to wire up a very simple language interpreter.  The full source is linked at the bottom of the article.

Iteration 2

Now that we have a working language implementation, it’s time to expand upon it.  After all, we can’t leave things working for long, can we?  For iteration 2, we will add the ability to print arbitrary messages using string and numeric literals, as well as a simple loop construct.  This loop will demonstrate some of the real merits of representing the program as a tree rather than a simple token stream.  As for syntax, we will define an example of these features as follows:

print HELLO
print 42

space
repeat 10
  print GOODBYE
next

space
print "Adios!"

The result should be the following:

Hello, World!
42

Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!

Adios!

The first thing we will need to do to support these new features is update our grammar.  This is where we start to see the advantages of using a declarative API like Scala’s combinators as opposed to creating the parser by hand.  We will need to change stmt terminal (to accept the repeat command) as well as the greeting terminal (to allow string and numeric literals):

def stmt: Parser[Statement] = ( "print" ~ greeting ^^ { case _ ~ g => Print(g) }
                              | "space" ^^^ Space()
                              | "repeat" ~ numericLit ~ (stmt+) ~ "next" ^^ {
                                  case _ ~ times ~ stmts ~ _ => Repeat(times.toInt, stmts)
                                } )
 
def greeting = ( "HELLO" ^^^ Hello()
               | "GOODBYE" ^^^ Goodbye()
               | stringLit ^^ { case s => Literal(s) }
               | numericLit ^^ { case s => Literal(s) } )

Notice that we can no longer rely upon type inference in the stmt method, as it is now recursive.  This recursion is in the repeat rule, which contains a one-or-more repetition of stmt.  This is logical since we want repeat to contain other statements, including other instances of repeat.  The repeat rule also makes use of the numericLit terminal.  This is a rule which is defined for us as part of the StandardTokenParsers.  Technically, it is more accepting than we want since it will also allow decimals.  However, we don’t need to worry about such trivialities.  After all, this is just an experiment, right?

The numericLit and stringLit terminals are used again in two of the productions for greeting.  Both of these parsers resolve to instances of String, which we pattern match and encapsulate within a new AST node: Literal.  It makes sense for numericLit to resolve to String because Scala has no way of knowing how our specific language will handle numbers.

These are the new AST classes required to satisfy the language changes:

case class Repeat(times: Int, stmts: List[Statement]) extends Statement
 
case class Literal(override val text: String) extends Greeting

Literal merely needs to encapsulate a String resolved directly out of the parse, there is no processing required.  Repeat, on the other hand, has a bit more interest to it.  This class contains a list of Statement(s), as well as an iteration count, which will define the behavior of the repeat when executed.  This is our first example of a truly recursive AST structure. Repeat is defined as a subclass of Statement, and it contains a List of such Statement(s).  Thus, it is conceivable that a Repeat could contain another instance of Repeat, nested within its structure.  This is really the true power of the AST: the ability to represent a recursive grammar in a logical, high-level structure.

Of course, Interpreter also must be modified to support these new features; but because of our polymorphic Greeting design, we only need to worry about Repeat.  This node is easily handled by adding another pattern to our recursive match:

case Repeat(times, stmts) :: rest => {
  for (i <- 0 until times) {
    walkTree(stmts)
  }
 
  walkTree(rest)
}

Here we see the primary advantage to leaving the execution processing within Interpreter rather than polymorphically farming it out to the AST: direct access to the walkTree method.  Logically, each repeat statement contains a new Simpletalk program within itself.  Since we already have a method defined to interpret such programs, it only makes sense to use it!  The looping itself can be handled by a simple for-comprehension.  Following the loop, we deconstruct the list and move on to the next statement in the enclosing scope (which could be a loop itself).  This design is extremely flexible and capable of handling the fully recursive nature of our new language.

The only other change we need to make is to add our new keywords to the lexer, so that they are not parsed as identifiers:

lexical.reserved += ("print", "space", "repeat", "next", "HELLO", "GOODBYE")

Iteration 3

It’s time to move onto something moderately advanced.  So far, we’ve stuck to easy modifications like new structures and extra keywords.  A more complicated task would be the addition of variables and scoping.  For example, we might want a syntax something like this:

let y = HELLO

space
print y
let x = 42
print x

space
repeat 10
  let y = GOODBYE
  print y
next

space
print y

space
print "Adios!"

And the result:

Hello, World!
42

Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!

Hello, World!

Adios!

This is significantly more complicated than our previous iterations partially because it requires name resolution (and thus, some semantic analysis), but more importantly because it requires a generalization of expressions.  We already have expressions of a sort in the Greeting nodes.  These are not really general-purpose expressions as they do not resolve to a value which can be used in multiple contexts.  This was not previously a problem since we only had one context in which they could be resolved (print).  But now, we will need to resolve Greeting(s) for both print and let statements.

We will start with the AST this time.  We need to modify our Greeting superclass to allow for more complex resolutions than a static text value.  More than that, these resolutions no longer take place in isolation, but within a variable context (referencing environment).  This context will not be required for Literal, Hello or Goodbye expressions, but it will be essential to handle our new AST: Variable.

sealed abstract class Statement
 
case class Print(expr: Expression) extends Statement
case class Space extends Statement
 
case class Repeat(times: Int, stmts: List[Statement]) extends Statement
case class Let(val id: String, val expr: Expression) extends Statement
 
sealed abstract class Expression {
  def value(context: Context): String
}
 
case class Literal(text: String) extends Expression {
  override def value(context: Context) = text
}
 
case class Variable(id: String) extends Expression {
  override def value(context: Context) = {
    context.resolve(id) match {
      case Some(binding) => binding.expr.value(context)
      case None => throw new RuntimeException("Unknown identifier: " + id)
    }
  }
}
 
case class Hello extends Expression {
  override def value(context: Context) = "Hello, World!"
}
 
case class Goodbye extends Expression {
  override def value(context: Context) = "Farewell, sweet petunia!"
}

Notice that the Print and Let nodes both accept instances of Expression, rather than the old Greeting node.  This generalization is quite powerful, allowing variables to be assigned the result of other variables, literals or greetings.  Likewise, the print statement may also be used with variables, literals or greetings alike.

Actually, the real meat of the implementation is contained within the Context class (not shown).  This data structure will manage the gritty details of resolving variable names into let-bindings (which can then be resolved as expressions).  Additionally, Context must deal with all of the problems associated with nested scopes and name shadowing (remember that our motivation example shadows the definition of the y variable).

For the moment, we need not concern ourselves with reassignment (mutability).  Redefinition will be allowed, but simplicity in both the grammar and the interpreter calls for fully immutable constants, rather than true variables.  Additionally, this restriction allows our interpreter to easily make use of fully immutable data structures in its implementation.  Scala doesn’t really impose this as an implementation requirement, but it’s neat to be able to do.  Also, the pure-functional nature of the data structures provide greater assurance as to the correctness of the algorithm.

Here is the full implementation of both Context and the modified Interpreter:

class Interpreter(tree: List[Statement]) {
  def run() {
    walkTree(tree, EmptyContext)
  }
 
  private def walkTree(tree: List[Statement], context: Context) {
    tree match {
      case Print(expr) :: rest => {
        println(expr.value(context))
        walkTree(rest, context)
      }
 
      case Space() :: rest => {
        println()
        walkTree(rest, context)
      }
 
      case Repeat(times, stmts) :: rest => {
        for (i <- 0 until times) {
          walkTree(stmts, context.child)
        }
 
        walkTree(rest, context)
      }
 
      case (binding: Let) :: rest => walkTree(rest, context + binding)
 
      case Nil => ()
    }
  }
}
 
class Context(ids: Map[String, Let], parent: Option[Context]) {
  lazy val child = new Context(Map[String, Let](), Some(this))
 
  def +(binding: Let) = {
    val newIDs = ids + (binding.id -> binding)
    new Context(newIDs, parent)
  }
 
  def resolve(id: String): Option[Let] = {
    if (ids contains id) {
      Some(ids(id))
    } else {
      parent match {
        case Some(c) => c resolve id
        case None => None
      }
    }
  }
}
 
object EmptyContext extends Context(Map[String, Let](), None)

Context is really little more than a thin wrapper around an immutable Map from String to Let.  The resolve method attempts to resolve a given identifier within the local context.  If that fails, it recursively attempts resolution on the parent context, to which it has a reference.  This procedes all the way up to the root context of the program, which has no parent.  If a resolution still fails at this point, None is returned and an error is thrown from within Interpreter.

The only other method of significance within Context is the + operator, which is the means by which new bindings are added to the context.  More specifically, a new context is constructed from the old in which the new binding is present (remember, immutable data).  The child value represents a new Context which is nested within the current (such as within a repeat block).  This can be a constant because of the immutability of the structure.  Thus, a single context may contain many nested contexts; but if none of them contain new bindings, then they will all be the exact same instance.

Jumping up to Interpreter, we see how Context is actually used.  Because our language lacks forward-referencing, we don’t have to worry about multiple passes through the AST.  This simplifies things tremendously, enabling us to handle semantic analysis and interpretation in a single effective phase.  As we walk through the AST, we carry the current Context along with us.  Whenever we come across a new binding (a let statement), we create a new Context with the binding and recursively pass it to the next node interpretation.  Whenever we come to the definition of a new context (in this case, a repeat statement), the child value of the current context is retrieved and passed to the interpretation of all nested statements.  This Context is discarded once we resume interpretation of the current level.

The final implementation step is to handle the grammar for our new language features.  This is easily done by modifying our existing combinitor definitions.  Additionally, we must add a new keyword (“let”) to the list of reserved identifiers, as well as the equals sign (“=”) to the list of token delimiters.

lexical.reserved += ("print", "space", "repeat", "next", "let", "HELLO", "GOODBYE")
lexical.delimiters += ("=")
 
// ...
 
def program = stmt+
 
def stmt: Parser[Statement] = ( "print" ~ expr ^^ { case _ ~ e => Print(e) }
                              | "space" ^^^ Space()
                              | "repeat" ~ numericLit ~ (stmt+) ~ "next" ^^ {
                                  case _ ~ times ~ stmts ~ _ => Repeat(times.toInt, stmts)
                                } 
                              | "let" ~ ident ~ "=" ~ expr ^^ { 
                                  case _ ~ id ~ _ ~ e => Let(id, e) 
                                } )
 
def expr = ( "HELLO" ^^^ Hello()
           | "GOODBYE" ^^^ Goodbye()
           | stringLit ^^ { case s => Literal(s) }
           | numericLit ^^ { case s => Literal(s) }
           | ident ^^ { case id => Variable(id) } )

Once again, the StandardTokenParsers gives us a great deal of functionality for free.  In this case, we make use of the ident terminal, which represents a Scala-style identifier in our language.  This token occurs in two places: the let statement and as a raw expression.  Just like stringLit and numericLit, ident is resolved down to a String instance, which we store in Let and Variable nodes for future use.

Conclusion

This concludes my introduction to the subtle and fascinating world of language processing.  In truth, I’m rather upset that I had to leave so many things out of the article and that a lot of my explanations were rushed at best.  There’s just so much information to fit into so little space!  I strongly suggest that you experiment further with the techniques associated with interpreters and (more interestingly) compilers.  These studies have helped me immeasurably as a programmer, imparting a much deeper understanding of what goes on “under the surface” and why things are the way they are in languages.

The Brilliance of BDD

9
Jun
2008

As I have previously written, I have recently been spending some time experimenting with various aspects of Scala, including some of the frameworks which have become available.  One of the frameworks I have had the privilege of using is the somewhat unassumingly-titled Specs, and implementation of the behavior-driven development methodology in Scala.

Specs takes full advantage of Scala’s flexible syntax, offering a very natural format for structuring tests.  For example, we could write a simple specification for a hypothetical add method in the following way:

object AddSpec extends Specification {
  "add method" should {
    "handle simple positives" in {
      add(1, 2) mustEqual 3
    }
 
    "handle simple negatives" in {
      add(-1, -2) mustEqual -3
    }
 
    "handle mixed signs" in {
      add(1, -2) mustEqual -1
      add(-1, 2) mustEqual 1
    }
  }
}

We could go on, of course, but you get the picture.  This code will lead to the execution of four separate assertions in three tests (to put things into JUnit terminology).  Fundamentally, this isn’t too much different than a standard series of unit tests, just with a slightly nicer syntax.

Specs defines a domain-specific language for structuring test assertions in a simple and intuitive way.  However, this is hardly the only framework for BDD.  Perhaps the most well-known such framework is RSpec, which answers a similar use-case in the Ruby programming language.  Our previous specification could be rewritten using RSpec as follows:

describe AddLib do
  it 'should handle simple positives' do
    add(1, 2).should == 3
  end
 
  it 'should handle simple negatives' do
    add(-1, -2).should == -3
  end
 
  it 'should handle mixed signs' do
    add(1, -2).should == -1
    add(-1, 2).should == 1
  end
end

The end-result is basically the same: the add method will be tested against the given assertions (all four of them) and the results printed in some sort of report form.  In this area, RSpec is significantly more mature than Specs, generating very slick HTML reports and nicely formatted console output.  This isn’t really a fundamental weakness of the Specs framework however, just indicative of the fact that RSpec has been around for a lot longer.

These two frameworks are interesting of course, but they are merely implementations of a much larger concept: behavior-driven development.  I’ve never been much of a fan of unit testing.  It’s always seemed to be incredibly dull and a very nearly fruitless waste of effort.  As much as I hate it though, I have to bow to the benefits of a self-contained test suite; and so I press on, cursing JUnit every step of the way.

BDD provides a nice alternative to unit testing.  At its core, it is not much different in that test groupings and primitive assertions are used to check all aspects of a test unit against predefined data.  However, there is something about the “flow” of a behavioral spec that is considerably easier to deal with.  For some reason, it is far less painful to devise a comprehensive test suite using BDD principles than conventional unit testing.  It seems a little far-fetched, but BDD actually makes it easier to write (and more importantly, formulate) exactly the same tests.

It’s an odd phenomenon, one which can only be caused by the storyboard flow of the code itself.  It is very natural to think of distinct requirements for a test unit when each of these requirements are being labeled and entered in a logical sequence.  Moreover, the syntax of both Specs and RSpec is such that there is very little boiler-plate required to setup an additional test.  Compare the previous BDD specs with the following JUnit4 example:

public class MathTest {
    @Test
    public void testSimplePositives() {
        assertEquals(3, add(1, 2));
    }
 
    @Test
    public void testSimpleNegatives() {
        assertEquals(-3, add(-1, -2);
    }
 
    @Test
    public void testMixedSigns() {
        assertEquals(-1, add(1, -2));
        assertEquals(1, add(-1, 2));
    }
}

JUnit just requires that much more syntax.  It breaks up the logical flow of the tests and (more importantly) the developer train of thought.  What can be worse is this syntax bloat makes it very tempting to just group all of the assertions into a single test – to save typing if nothing else.  This is problematic because one assertion may shadow all the others in the case of a failure, preventing them from ever being executed.  This can make certain problems much more difficult to isolate.

Logical flow is extremely important to test structure.  BDD frameworks provide a very nice syntax for painlessly defining comprehensive test suites.  The really wonderful thing about all of this is that BDD is available on the JVM, right now.  There’s nothing stopping you from writing your code in Java as you normally would, then creating your test suite in Scala using Specs rather than JUnit.  Alternatively, you could use RSpec on top of JRuby, or Gspec with Groovy.  All of these are seamless replacements for a test framework like JUnit, and requiring of far less syntactic overhead.

The growing move toward polyglot programming encourages the use of a separate language when it is best suited to a particular task.  In this case, several languages are available which offer far more powerful test frameworks than those which can be found in Java.  Why not take advantage of them?

The Problem of Perspective Multiplicity

2
Jun
2008

Some six years ago, I switched my primary IDE from NetBeans to Eclipse JDT (then 2.0).  At the time, I did this primarily because NetBeans was too much of a resource hog for my pathetic development machine, but I quickly learned to appreciate the power of the Eclipse development environment.  NetBeans has since made great strides of course, but at the time, Eclipse was lightyears beyond it in both features and polish.

One of the more interesting features offered by Eclipse was the concept of a “perspective”, a collection of views in a specific layout conducive to performing a specific series of tasks.  The major upshot of this was instead of the debugger views popping in and out, they simply remained hidden in a separate perspective, ready to restore to your customized configuration as necessary.  This innovation was also present in other areas, such as the CVS Team view and the Update Manager (yes, the Eclipse update system was once a set of views and editors).

You could switch between these perspectives manually of course, but most of the time Eclipse was able to just detect which perspective you needed and make the switch automatically.  If you were to launch an application in debug mode for example, the “Debug” perspective would be opened automatically, bringing useful views to the fore.  Once you were done debugging, it was easy to switch back to the “Java” perspective for more streamlined editing.  It was a good system, and it worked well.

Unfortunately, times have changed.  Don’t get me wrong, I still love having all my debug views and layout saved for me in a discrete section of the app, ready to access on a moment’s notice.  But Eclipse is no longer the single-purpose application it once was.  Yes, I know that it has always been billed as “an open tool platform for everything and nothing in particular”, but back in the day (and especially before OSGi) most people had yet to realize this.  The only language supported by Eclipse on any serious level was Java, thus the perspective system worked extremely well for organizing IDE views.  Now, Eclipse serves as the foundation for IDE frameworks supporting dozens of different languages, requiring an equal (if not greater) number of perspectives.

 image

Even in this screenshot, I’m still hiding easily 70% of the perspectives available to Eclipse.  With all of these different view collections and configurations, it’s no wonder that people often find Eclipse to be confusing compared to other IDEs.  In NetBeans (for example) you can work with as many languages as you want within a single perspective/layout/configuration.  The outline shows the relevant information for whatever file you have open, and the project explorer view is fully integrated with each language, showing all available projects and their associated structure.  Most importantly, this view is able to show project logical structure as dictated by the support module for that language (e.g. src/, test/, etc).

Effectively, other IDEs have evolved a single “Development” perspective, one which shows a generic set of views common to all languages.  Unlike Eclipse, which requires switching to the Ruby perspective or the C/C++ perspective to get the appropriate project viewer, NetBeans has one project viewer which is extensible by any module.  Eclipse has some of this with the Package Explorer, but some plugins like DLTK don’t properly integrate and so the view isn’t as streamlined.  Additionally, some functions like “Open Type” don’t work appropriately unless in the corresponding perspective for a given language.

Yes, I am aware that I could simply open any views I want within a single perspective, but that’s not what I’m looking for.  I don’t want to open five different views for navigating project files, I want to have one master view which shows me everything through the filter of whatever language is relevant to the project.  Project Explorer comes close, but it fails to handle the tighter integration (such as the “Referenced Libraries” in JDT or script outlines for DLTK).

Theoretically, Eclipse only needs four or five perspectives for the average developer working with any number of languages: Develop, Debug, Test, Repositories, Synchronize.  Obviously, more perspectives would be needed for functionality which does not conform to normal development conventions (such as “Planning” or even “Email”), but I think that these core perspectives could provide a consistent, generic framework to which any language IDE could conform.  We can already see something similar happening with the Debug perspective, which is used by Java, Ruby, Scala and C/C++ alike.

What is needed is a common super-framework to be extended by actual language implementations such as JDT, CDT and the like (similar to what DLTK provides but more encompassing).  This framework should provide a common platform with features such as project viewing, outline, documentation, type hierarchy, call hierarchy, open type, etc.  This platform would then be specialized by the relevant IDE and the same views would allow extension to fit the needs of the language in question.  This already happens with the Outline view, but it needs to occur with other common functions as enumerated.  Views which are not common to different languages (such as Ant Build or Make Targets) would of course not be contained within this super-framework, but would be separate views as they are now.  This framework would allow a developer to use a single set of views for any language, never requiring a workflow-disrupting change of perspective.

The building blocks are all in place, and such an effort would still be in line with the Eclipse philosophy of total extensibility, it’s merely a question of implementation and opinion.  The implementation is simple, as I said, most of the functionality is already available (often redundantly) in any one of the many IDE packages.  The bigger challenge is to convince those who have the power to make the decision.  Eclipse 4.0 is coming, it should be an interesting road to follow.

Naïve Text Parsing in Scala

26
May
2008

One of the truly incredible things about Scala is that it really inspires people to consider problems that they never would have attempted before.  Recently, the urge came upon me to try my hand at some more advanced text processing.  Not quite so advanced as a full language, but more complicated than can be easily handled by regular expressions.  As usual, Scala proved more than up to the challenge.

All of us have evolved more-or-less ad hoc methods for handling simple text processing.  In this modern age, it’s almost a requirement to be familiar with basic regular expressions, or at the very least split, subString and find.  These techniques tend to work well in small-scale applications, but things become a bit muddled when trying to deal with more complex or conditional representations.  It gets even worse when you must perform some sort of complex resolution on the results of your parsing, requiring you to devise an intermediate form.

One such example of a more complicated text parse would be the C-style printf function.  Java has had this functionality since 1.5 in the form of PrintStream#printf(...) as well as String.format(String, Object...); but unfortunately, Scala lacks this highly useful method.  Oh, it has a printf function, but it doesn’t support C-style syntax for reasons of backwards compatibility.  This caused me no end of grief when I was trying to construct a quine in Scala.  Since Scala has no printf, I decided to try my hand at implementing one (just for kicks).

Finite State Machines

As I said, the ad hoc parsing techniques may serve us well when we’re just trying to split a full name into a firstname/lastname tuple, but I’m afraid that printf requires a more disciplined approach.  Fortunately, there are a number of beautiful formalisms for dealing with text parsing.  Chief among these are deterministic finite state machines.

If you took formal language theory in college, you’ve probably already worked with DFAs (Deterministic Finite Automata), NFAs (Non-deterministic Finite Automata) and PDAs (Pushdown Automata); but since everyone I know slept through that class, I’ll just assume that you did too and go over some of the basics again.  Finite state machines (automata) are at the core of Turing’s seminal thesis on computability.  Actually, the so-called “Turing Machine” is at the core of his work, but DFAs are really just a limited form of this concept.

The ideas behind the acronyms are very simple: a finite state machine is a collection of “states” which have connections to each other which dictate ensuing states or termination of the execution.  The most common representation of a DFA is a directed graph.  The states are represented by the vertices of the graph.  The double-circle indicates an “accepting state”.

 image

This is a simple DFA which has four accepting states (4, 6, 7 and 8).  There is also a loop transition on state 3.  Each of these states represents a different position (or “state”, hence the term) in the parse process.  The idea is that you consume just one character at a time and based on the character value, the automaton “chooses” the correct transition.  It’s all very mindless, very sequential (hence the name “automaton”).

The only problem here is there may not be a transition for every possible character.  For example, starting from state 1, we know how to handle characters a, b, c, d and g, but what happens if we actually get an s or a 7?  By some definitions, this failing would indicate that we have an invalid DFA, something which is obviously bad.  Most representations however allow unsatisfied input and merely have an implicit transition to an accepting error state.  Most common applications make use of this rule (we’ll get to that in a minute).

If you execute the given automaton, you will find that it accepts the following inputs:

  1. ban
  2. aa120m
  3. da1o
  4. gs
  5. ga

…but rejects (or errors) on these alternative sequences:

  1. aaa7
  2. da6mn
  3. gq88

These claims are fairly easy to verify by mentally consuming each character in turn and transitioning to the corresponding state (if any).  Thus, for the first series of inputs, the state sequences will be as follows:

  1. 1, 2, 3, 4
  2. 1, 2, 3, 3, 3, 3, 4
  3. 1, 2, 3, 3, 7
  4. 1, 5, 6
  5. 1, 5, 8

Notice how every parse starts with the initial state (1).  This may seem sort of academic (since the parse information is all encoded in the transitions), but it turns out that without this formalism, many common every-day tasks which we take for granted would be impossible.

If you look closely at my example, you’ll notice that you can very easily encode the same accept/reject information using a regular expression:

[a-d]a[0-9]*([nm]|[a-lo-z])|g([n-z][0-9]?|[abc])

Ok, so maybe that’s not the easiest connection to make, but I think you get the picture.  As it turns out, regular expressions are a direct textual representation of deterministic finite state automata.  In fact, algorithms for executing regular expressions compile the regular expression into a DFA (using various techniques) and then execute this DFA against the input.  It does require an intervening step to convert from a regular expression to a DFA, but it’s not that difficult to do.

The printf Case Study

Now that I’ve managed to lull all of you to sleep, it’s time to get back to more practical matters.  All of this formal theory actually has some very down-to-earth applications, including the algorithms required to implement printf.

C-style printf has a fairly flexible syntax which allows not only simple substitutions, but also type-dependent formatting, padding and truncation.  For example, we can do something like this in Java:

double pi = 3.14159;
System.out.printf("My favorite number: %n%80.2f", pi);

The result looks like this (including all the space):

My favorite number:
                                                                            3.14

There are a number of different conversions available, denoted by the letter trailing the escape – in this case, n and f respectively.  There are also a large number of flags which can be used, the capability to specify the argument index, etc.  Altogether, the context-free grammar for this format looks like this (source):

format ::= '%' index flags width precision conversion

index ::= INTEGER '$' | '<' | ε

flags ::= flags flag | ε
flag ::= '-' | '#' | '+' | ' ' | '0' | ',' | '('

width ::= INTEGER | ε
precision ::= '.' INTEGER | ε

conversion ::= 'b' | 'B'
             | 'h' | 'H'
             | 's' | 'S'
             | 'c' | 'C'
             | 'd'
             | 'o'
             | 'x' | 'X'
             | 'E' | 'E'
             | 'f'
             | 'g' | 'G'
             | 'a' | 'A'
             | ( 't' | 'T' ) date_format
             | '%'
             | 'n'

date_format ::= ...

I have omitted the grammar for date formatting just for the sake of simplicity.  The epsilon (ε) symbolizes the empty string ("").  In case you found the above confusing, this is a (slightly) more human-readable variant:

%[argument_index$][flags][width][.precision]conversion

Essentially, this boils down to the following: A substitution format is escaped by a percent sign (%) followed by an optional index, flags, width and precision, as well as a mandatory conversion indicator.  The date/time conversion is special and takes a series of formatting parameters immediately trailing the conversion.  For the sake of sanity, our parser implementation will ignore this inconvenient fact.

We could take this CFG (context-free grammar) and feed it into a parser generator (such as the one built into Scala) and generate an AST in that way.  However, in this particular instance there is no need.  A cursory glance at the grammar indicates that there is no case wherein the syntax is self-recursive.  A minor exception to this is the flags terminal, but this is really just a way of expressing a repetition in BNF-ish style.  A moment’s reflection will lead us to the (correct) conclusion that because the grammar is non-recursive, it can also be represented as a regular expression – thus, a DFA.  In fact, you can prove this point, but that bit of academic trivia is unimportant for the moment.

What is important is to realize that as this grammar is expressible in the form of a DFA, we can actually write code which parses it without too much trouble.  Parsers can (and have) been written by hand, but usually when the grammar gets complex, the parser reflects this exponentially.  While it is not difficult to write a simple PDA by hand, doing so would be overkill.  So rather than starting with the BNF grammar and creating a literal representation, we will work off of the one-line informal syntax and produce a stackless automaton (the defining feature of a pushdown automaton is the use of a stack to maintain recursive state).

Implementation

As it turns out, all of this gobbly-gook expresses very elegantly in functional languages.  In truth, I could have written the parser in ML, but it is much more fun to use Scala.  We start out by considering how we want the intermediate form to be expressed.  Since we’re not writing a true compiler, we don’t need to worry about serializing this IF into anything persistent; we can rely entirely on memory state.  For a more complicated grammar, we might write classes to represent a tree structure (commonly referred to as an AST).  However, because printf escapes are so straightforward, we can simply generate a token stream.  We will represent this as List[Token] using the following definitions.

sealed abstract class Token
 
case class CharToken(token: Char) extends Token
case class FormatToken(index: Index, flags: Set[Flag.Value],
                                       width: Option[Int], precision: Option[Int],
                                       format: Char) extends Token
 
sealed abstract class Index
 
case class Value(index: Int) extends Index
case class Previous extends Index
case class Default extends Index
 
object Flag extends Enumeration {
  val LEFT_JUSTIFIED,
      ALTERNATE,
      SIGNED, 
      LEADING_SPACE,
      ZERO_PADDED, 
      GROUP_SEPARATED,
      NEGATIVE_PAREN = Value
}
 
val flagMap = {
    import Flag._
 
    Map('-' -> LEFT_JUSTIFIED, '#' -> ALTERNATE, '+' -> SIGNED, 
        ' ' -> LEADING_SPACE, '0' -> ZERO_PADDED, ',' -> GROUP_SEPARATED,
        '(' -> NEGATIVE_PAREN)
  }

Note that Option is insufficient to represent index because of the < escape (use previous index).  Thus, we define a separate series of types with three alternatives: Value, Previous and Default).  This is similar to Option, but more specific to our needs.  Finally, the flags are represented as an enumeration.  Scala doesn’t have language-level support for enumerations, so the syntax ends up being a fair-bit more verbose than the equivalent Java.  It is for this reason that enumerations aren’t used very much in Scala, instead preferring sealed case classes and object modules (to serve as the namespace).

Our parser will have to consume the entire format string, including non-escapes.  The final representation will be an immutable list of Token(s), either CharToken for a single run-of-the-mill character, or FormatToken which will represent the fully-parsed substitution.  Thus, for the printf example given above (my favorite number), the token stream will look something like this:

CharToken('M') :: CharToken('y') :: CharToken(' ') :: /* ... */ :: 
    FormatToken(Default(), Set(), None, None, 'n') :: 
    FormatToken(Default(), Set(), Some(80), Some(2), 'f') :: Nil

For those of you unfamiliar with the cons operator (::), it is just about the most useful functional idiom known to exist, especially in conjunction with pattern matching.  All it does is construct a new linked list with the value on the left as the head and the list to the right as the tail.  Nil is the empty list and thus commonly serves as the tail of a compound cons expression.

To produce this token stream, we will need to write an automaton which consumes each character in the stream and inspects it to see if it marks the beginning of a substitution.  If not, then a CharToken should be generated and put in the list.  However, if the character does mark an escape, then the automaton should transition to a different branch, consuming characters as necessary and walking through the algorithmic representation of our one-line syntax.  It is possible to diagram the necessary automaton, but to do so would be both pointless and unhelpful.  It’s probably easier just to dive into the code:

type Input = ()=>Option[Char]
 
def parse(stream: Input): List[Token] = {
  stream() match {
    case Some('%') => parseIndex1(stream) :: parse(stream)
    case Some(x) => CharToken(x) :: parse(stream)
    case None => Nil
  }
}

Rather than trying to efficiently walk through a proper String instance, it is easier to deal with a single-character stream.  The Input type alias defines a function value which will return Some(x) for the next character x in the string, or None if the end of the string has been reached.  It’s like a type-safe EOF.  We will call this method in the following way:

var index = -1
val tokens = parse(() => {
  index = (index + 1)
  if (index < pattern.length) Some(pattern(index)) else None
})

Our cute use of mutable state (index) makes this code far more concise than it would have been had we attempted to do things functionally.  As it turns out, this is the only place is in our parser where we need to maintain state which is not on the stack.  Because no lookahead is required, we can simply march blindly through the syntax, consuming every character we come across and transitioning to a corresponding state.

The first state of our automaton is represented by the parse(Input) method.  It has a transition to a normal character-consuming state, which in turn transitions back to parse (CharToken(x) cons’d with the recursive evaluation).  Our first state also has a transition to a more complicated state represented by parseIndex1(Input).  This transition takes place whenever we consume a percent (%) character.  What happens next is much easier to explain with code than in words (warning: 90-line snippet):

def parseIndex1(stream: Input) = stream() match {
  case Some(c) => {
    if (flagMap contains c) {
      parseFlags(stream, Default(), Set(flagMap(c)))
    } else if (c == '<') {
      parseFlags(stream, Previous(), Set[Flag.Value]())
    } else if (c == '.') {
      parsePrecision(stream, Default(), Set[Flag.Value](), None, 0)
    } else if (Character.isDigit(c)) {
      parseIndex2(stream, Character.digit(c, 10))
    } else {
      parseFormat(stream, Default(), Set[Flag.Value](), None, None, c)
    }
  }
 
  case None => throw new InvalidFormatException("Unexpected end of parse stream")
}
 
def parseIndex2(stream: Input, value: Int): FormatToken = stream() match {
  case Some(c) => {
    lazy val index = if (value == 0) Default() else Value(value)
    lazy val width = if (value == 0) None else Some(value)
 
    if (c == '.') {
      parsePrecision(stream, Default(), Set[Flag.Value](), width, 0)
    } else if (c == '$') {
      parseFlags(stream, index, Set[Flag.Value]())
    } else if (Character.isDigit(c)) {
      parseIndex2(stream, (value * 10) + Character.digit(c, 10))
    } else {
      parseFormat(stream, Default(), Set[Flag.Value](), width, None, c)
    }
  }
 
  case None => throw new InvalidFormatException("Unexpected end of parse stream")
}
 
def parseFlags(stream: Input, index: Index, flags: Set[Flag.Value]): FormatToken =
  stream() match {
    case Some(c) => {
      if (flagMap contains c) {
        parseFlags(stream, index, flags + flagMap(c))
      } else if (c == '.') {
        parsePrecision(stream, index, flags, None, 0)
      } else if (Character.isDigit(c)) {
        parseWidth(stream, index, flags, Character.digit(c, 10))
      } else {
        parseFormat(stream, index, flags, None, None, c)
      }
    }
 
    case None => throw new InvalidFormatException("Unexpected end of parse stream")
  }
 
def parseWidth(stream: Input, index: Index, flags: Set[Flag.Value], 
               value: Int): FormatToken = stream() match {
  case Some(c) => {
    lazy val width = if (value == 0) None else Some(value)
 
    if (c == '.') {
      parsePrecision(stream, index, flags, width, 0)
    } else if (Character.isDigit(c)) {
      parseWidth(stream, index, flags, (value * 10) + Character.digit(c, 10))
    } else {
      parseFormat(stream, index, flags, width, None, c)
    }
  }
 
  case None => throw new InvalidFormatException("Unexpected end of parse stream")
}
 
def parsePrecision(stream: Input, index: Index, flags: Set[Flag.Value], 
                   width: Option[Int], value: Int): FormatToken = stream() match {
  case Some(c) => {
    lazy val precision = if (value == 0) None else Some(value)
 
    if (Character.isDigit(c)) {
      parsePrecision(stream, index, flags, width, (value * 10) + Character.digit(c, 10))
    } else {
      parseFormat(stream, index, flags, width, precision, c)
    }
  }
 
  case None => throw new InvalidFormatException("Unexpected end of parse stream")
}
 
def parseFormat(stream: Input, index: Index, flags: Set[Flag.Value],
                width: Option[Int], precision: Option[Int], c: Char) = {
  FormatToken(index, flags, width, precision, c)
}

If you can get past the sheer volume of code here, it actually turns out to be pretty simple.  Each method represents a single state.  Some of these states have loop transitions (so as to consume multi-digit precisions, for example), but for the most part, flow travels smoothly from each state to the next.  The transitions are defined by the if/else if/else expressions within each method.  Note that due to the fact that every if statement has a corresponding else, we are allowed to treat them has expressions with a proper value and thus avoid the use of any explicit returns (improving the conciseness of the code).

The final state is represented by the parseFormat(...) method.  This method constructs a FormatToken based on the accumulated values and then returns, unwinding our long and recursive automaton branch all the way back to the parse method, which places our token in the list and moves on.  Simple and to the point.

Tail Recursion

As a side-bonus, it is possible to rewrite the parse method so that it is tail recursive, allowing the Scala compiler to overwrite each stack frame with its successor.  Some of the substitution state methods are already tail recursive, but these loop far less frequently than parse.  In fact, if we don’t write a tail recursive parser, we will be unable to handle large strings due to stack overflow.

The tail recursive form of parse is nowhere near as elegant, but it gets the job done.  Like most tail recursive methods, it makes use of an accumulator which is passed from each call to the next.  So rather than parsing the tokens recursively and then constructing the list as we pop back up the stack, we construct the list as we go and return the completed value at the end.  The only problem with this is that cons prepends elements to the list.  This means that when we finally return the accumulated list, it will be the exact inverse of what we want.  Thus, we must must explicitly reverse the list at the termination of the character stream.  This actually means that the tail recursive form will require more bytecode instructions than the original, but it will execute more efficiently due to the local elimination of the stack (effectively, scalac will collapse the method into a while loop at compile-time).

def parse(stream: Input) = tailParse(stream, Nil)
 
def tailParse(stream: Input, back: List[Token]): List[Token] = {
  stream() match {
    case Some('%') => tailParse(stream, parseIndex1(stream) :: back)
    case Some(x) => tailParse(stream, CharToken(x) :: back)
    case None => back.reverse
  }
}

Conclusion

Hopefully, this has been a thoroughly enjoyable visit to the land of parsing and formal language theory (I had fun anyway).  As usual, Scala proves itself to be an extremely expressive language, capable of representing both the theoretical and the practical with ease.  It almost makes me want to write a more complicated parser by hand, just to see how well Scala handles it.

I’m not entirely sure what I want to do with the result.  As I mentioned, Scala needs an in-language implementation of printf, so maybe I’ll flesh out the implementation some more and submit a patch.  The unfortunate problem with this is printf is more than just a parser.  We can’t just take our token stream and pipe it to stdout, hoping for an epsilon transition.  As it turns out, walking this token stream and formatting the substitutions proves to be a very ugly, very tedious task.  I’ve already implemented most of the core substitution functionality, but a lot of the more complicated stuff remains undone.  If anyone’s interested in the full sources + a BDD specification for printf, just let me know.  :-)

Parsing is a very interesting science with a world of representations and experiences to draw upon.  Even for simple grammars like printf, many lessons can be learned about the fundamentals of computing and just what constitutes a language.  And what better language to use in learning these lessons than Scala?

Update: Public interest seems to be high enough to merit uploading the full project.  You should be able to download using the link below (project managed with Buildr).

Dramatically Improved UI in jEdit

22
May
2008

This is definitely old news by now (in fact, almost a month old), but I’m just now discovering it myself so I decided to share.  The jEdit project is renowned for two things:

  • Marvelous support for every language under the sun
  • Eye-bleedingly bad UI design

It’s always been possible to hack yourself an improved version without too much trouble; but by default, jEdit has always looked terrible.  This one factor, more than anything else, has contributed to jEdit’s reputation as the supercharged editor which everyone refuses to try.  Fortunately, this influence has been seriously reduced in the 4.3pre14 release:

image

Compare that to the old look.  Even with Java 6 subpixel rendering, the interface remained a mess.  What’s more, many of the interface elements were custom renderings, preventing the platform-native LAF from appropriately styling them (the toolbar controls are a prime example).  All of this is fixed in 4.3pre14.

jEdit is rapidly approaching “usable editor” status out of the box, something that even the mighty TextMate hasn’t quite achieved.  Granted, it’s still Swing-based, which means the fonts render horribly on Vista without Java 6uN, but it’s a step in the right direction.  Now, if only they would do something about their website