Skip to content

Bencode Stream Parsing in Java

15
Jul
2008

It’s surprising how universal XML has become.  It doesn’t seem to matter what the problem, XML is the solution.  For example, consider a simple client/server architecture where the communication protocol must transmit some sort of structured data.  Nine developers out of ten will form the basis of the protocol around XML.  If it’s a lot of data to be transferred, then they will compress the XML using Java’s stream compression libraries.  If there’s binary data to be transmitted, it will either be stored as CDATA within the XML or as files within the same compressed archive.  Very few developers will actually stop and consider alternative solutions.

One such “alternative solution” is bencode (pronounced “bee-encode”).  Similar to formats like XML and JSON, bencode defines a series of constructs which may be used to encode arbitrarily complex data.  However, unlike XML, the design focus of the format was not to produce verbose, human-readable documents, but rather to encode data in the most concise manner possible.  To that end, the core bencode specification only includes four data types, two simple and two composite structures.  These types are defined with an almost complete absence of meta, requiring very little “structure” to clutter the data stream.

Unfortunately, outside of applications like BitTorrent, this elegant binary format has seen remarkably little adoption.  Because of this state of affairs, it can be extremely difficult to find libraries to actually process bencode data.  Not too long ago, I ran into a production use-case which required both parsing and generation of bencode-formatted files.  I considered digging into the source code for Vuze (nee “Azureus”), but a) it seemed like a lot of boring, nearly-wasted effort, and b) I strongly suspect that their bencode parser and generator are extremely space inefficient, since the data sources which they deal with are remarkably small.

The second hang-up was really a more significant motivator than the first, due to the fact that I knew I would be dealing with bencode streams potentially gigabytes in size.  So, rather than fruitlessly dig through someone else’s code, I decided to put all of this formal parser theory to work and roll my own library.  Unless you’re already familiar with bencode, I suggest you read the Wikipedia article to get a feel for the format, otherwise some of what I will be talking about will make no sense at all.  :-)

The first thing I needed to do was build the generation half of the library.  I decided that it would be easier if I avoided trying to use the same backend framework classes with both the generator and the parser.  For example, there are actually two classes in the framework which contain the logic for handling an integer: IntegerValue and IntegerType.  The former is for use in the parser, while the latter is for use in the generator.  This separation of logic may seem a little strange, but it actually simplifies things tremendously.

Remember my primary requirement: extremely efficient implementation of both generator and parser, especially with respect to space.  If I attempted to use the same classes to represent data for both the parser and the generator, then the parser would be forced to read the entire stream into some sort of in-memory representation (think about it; it’s actually true).  Obviously, this is unacceptable for streams that are gigabytes in size, so the traditional “good design” from an object-oriented standpoint was out.

Stream Generation

Since I needed the functionality of bencode stream generation before I needed parsing, I started with that aspect of the framework.  Here again, the most obvious “object-oriented” approach would have been the wrong one.  When we think of generating output in a structured format programmatically, we naturally imagine a DOM-like tree representation (preferably framework-agnostic) which is then walked by the framework to produce the output.  The major disadvantage to this approach is that it requires paging everything into memory.  This works for smaller applications or situations where the data is already in memory, but for my particular use-case, it would have been disastrous.

The only way to avoid paging everything into memory for stream generation is to structure the API so that the data is “pulled” by the generator, rather than “pushed” to it in tree-form.  In other words, the data itself has to be lazy-loaded, using callbacks to grab the data as-needed and hold it in memory only as long as is absolutely necessary.  In a functional language, this would be done with closures (or even normal data types in a pure-functional language).  However, as we all know, Java does not support such time-saving features.  The only recourse is to use abstract classes and interfaces which can be overridden in anonymous inner-classes as well as top-level classes as necessary.

image

After a bit of experimentation, the finalized hierarchy looks something like this.  Logically, every type must be able to query its abstract method for data of a certain Java type (long for IntegerType, InputStream for StringType, etc), convert this data into bencode with the appropriate meta, and then write the result to a given OutputStream.  Also following our nose, we see the semantic differences between composite and primitive types are really quite limited, especially if we simplify everything to a black box “get data / write encoding” methodology.  In fact, the only thing that CompositeType actually does is enforce the prefix/suffix encoding of every composite type.  Since this is in compliance with the bencode specification, we are safe in extracting this functionality into a superclass.

The more interesting distinction is between so-called “variant” and “invariant” types.  This is where you should begin to notice that I have over-engineered this library to some degree.  If I was just trying to create a pure bencode generator, then I could have skipped InvariantPrimitiveType and VariantPrimitiveType and just let IntegerType and StringType extend PrimitiveType directly.  This comes back to my initial requirements.

Priority one was to create a framework which was blazingly fast, but priority two was to ensure that it was extensible at the type level.  For the particular application I was interested in, I required more than just the core bencode types.  Also on the agenda were proper UTF-8 strings, dates, and support for null.  To accommodate all of this without too much code duplication, I knew I would have to extract a lot of the functionality into generic superclasses.  Hence my somewhat incorrect use of the terms “variant” and “invariant” to describe the difference between the integer type - which is prefix/suffix delimited - and the string type - which defines a length as its prefix and has no closing suffix.

Anyway, back to the problem at hand.  In addition to the CompositeType and PrimitiveType, you should also notice EntryType.  This “extra” type exists to handle the fact that bencode dictionaries are extremely weird and sit rather outside the “common functionality” umbrella of the format in general.  For one thing, the specification requires that dictionary entries be sorted by key, obviously implying some sort of Comparable relation.  Moreover, these keys must be themselves strings, but StringType isn’t comparable because its writeValue(OutputStream) method doesn’t return the data in question, but merely writes it to a given OutputStream.  Are we starting to see the problems with space-efficient implementations?

Enough babble though, let’s see some code!  Here’s how we might encode some very simple data using the generator framework:

public class GeneratorTest {
    public static void main(String[] args) {
        ByteArrayOutputStream os = new ByteArrayOutputStream();
        final byte[] picture = new byte[0];        // presumably something interesting
 
        DictionaryType root = new DictionaryType() {
            @Override
            protected void populate(SortedSet<EntryType<?>> entries) {
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("name"), 
                        new LiteralStringType("Arthur Dent")));
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("number"), 
                        new IntegerType(42)));
 
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("picture"), 
                        new StringType() {
 
                    @Override
                    protected long getLength() {
                        return picture.length;
                    }
 
                    @Override
                    protected void writeValue(OutputStream os) throws IOException {
                        os.write(picture);
                    }
                }));
 
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("planets"), 
                        new ListType() {
 
                    @Override
                    protected void populate(ListTypeStream list) throws IOException {
                        list.add(new LiteralStringType("Earth"));
                        list.add(new LiteralStringType("Somewhere else"));
                        list.add(new LiteralStringType("Old Earth"));
                    }
                }));
            }
        };
 
        try {
            root.write(os);
        } catch (IOException e) {
            e.printStackTrace();
        }
 
        System.out.println(new String(os.toByteArray()));
    }
 
    private static class LiteralStringType extends StringType 
            implements Comparable<LiteralStringType> {
        private final String value;
 
        public LiteralStringType(String value) {
            this.value = value;
        }
 
        @Override
        protected long getLength() {
            return value.length();
        }
 
        @Override
        protected void writeValue(OutputStream os) throws IOException {
            os.write(value.getBytes("US-ASCII"));
        }
 
        public int compareTo(LiteralStringType o) {
            return o.value.compareTo(value);
        }
    }
}

It’s hard to imagine why some people claim that Java is a verbose language…

The API may seem a little clumsy, but most of that is caused by the conniptions required to make the generator lazily pull the data, rather than paging it all into memory ahead of time.  Throwing that aside, the rest of the verbosity seems to come from the need for LiteralStringType, rather than just having a StringType which could handle this for us.  The reason for this extra headache is shown in the population of the “picture” field, which presumably may contain several megabytes worth of data from some external source such as a file or database (in this case of course, it doesn’t contain anything, but that’s besides the point).

The result of the above is as follows:

d4:name11:Arthur Dent6:numberi42e7:picture0:7:planetsl5:Earth14:Somewhere else9:Old Earthee

Or, with a little formatting to make it more palatable:

d
  4:name
  11:Arthur Dent

  6:number
  i42e

  7:picture
  0:

  7:planets
  l
    5:Earth
    14:Somewhere else
    9:Old Earth
  e
e

Technically, this is no longer valid bencode, but it is much easier to read this way.

The Parser

With all this bustle surrounding the generator, it’s easy to forget about the inverse process: parsing.  As it turns out, this is both easier and far less elegant than the solution for the generator (I know, it’s a sad state of affairs when the above is considered “elegant”).  Here again, there was a need for the parser to be extremely efficient, especially in terms of memory.  Thus, the logical approach of simply parsing the stream into an in-memory tree doesn’t really work.  Instead, the parser must be a so-called “pull parser”, which only parses each token upon request.  The parser only does exactly what work you ask of it, nothing more.

My initial designs for the parser attempted to follow the example set by the generator: each value type self-contained, responsible for parsing its own format.  As it turns out, this can be difficult to accomplish.  I could have expanded slightly on the parser combinator concept, but monads are very clumsy to achieve in Java, which led me to rule out that option.  In the end, I took a middle ground.

Click for full size

As before, a common superinterface sits above the entire representative hierarchy.  To understand this hierarchy a little better, perhaps it would be helpful to look at the full source for Value:

public interface Value<T> {
    public T resolve() throws IOException;
    public boolean isResolved();
}

The resolve() method is really the core of the entire parser.  The concept is that each value will be able to consume the bytes necessary to determine its own value, which is converted and returned.  This is extremely convenient because it enables VariantValue(s) (such as string) to carry the logic for parsing to a specific length, rather than the conventional e terminator.  In order to avoid clogging up memory, the return value of resolve() should not be memoized (though, there is nothing in the framework to prevent it).  Conventionally, values which are already resolved should throw an exception if they are resolved a second time.  This prevents the framework from holding onto values which are no longer needed.

You will also notice that CompositeValue not only inherits from Value, but also from the JDK interface, Iterable.  Logically, a composite value is a linear collection of values, consumed one at a time.  To me, that sounds a lot like a unidirectional iterator.  We can, of course, resolve the entire composite at once, mindlessly consuming all of its values, but since all of the values are lost once consumed, the only purpose for such an action would be if we know that we don’t care about a particular composite and we just want to rapidly skip to the next value in the stream.

Returning to primitive values, the resolve() method for IntegerValue is worthy of note, not so much for its uniqueness, but because it is very similar to the parsing technique used in all the other values:

public Long resolve() throws IOException {
    if (resolved) {
        throw new IOException("Value already resolved");
    }
    resolved = true;
 
    boolean negative = false;
    long value = 0;
 
    int b = 0;
    while ((b = is.read()) >= 0) {
        int digit = b - '0';
 
        if (digit < 0 || digit > 9) {
            if (b == '-') {
                negative = true;
            } else if (b == 'e') {
                break;
            } else {
                throw new IOException("Unexpected character in integer value: " 
                    + Character.forDigit(b, 10));
            }
        } else {
            value = (value * 10) + digit;
        }
    }
 
    if (negative) {
        value *= -1;
    }
 
    return value;
}

The i prefix itself is consumed before control flow even enters this method.  This is because the prefix is required to determine the appropriate value implementation to use.  Specifically, the logic to perform this determination is contained within the Parser class, which maintains a map of Value(s) and their associated prefixes.  String values have special logic associated with them, as they do not have a prefix.

As with most hand-coded parsers, this one operates on the principle of “eat until it hurts”.  We start out by assuming that the integer value extends to the end of the stream, then we set about to find a premature end to the integer, at which point we break out and call it a day.  Since we are moving from left to right through a base-10 integer, we must multiply the current accumulator by 10 prior to adding the new digit. 

Actually, the real heart of the parser framework is CompositeValue.  This class is inherited by Parser to define a special value encompassing the stream itself (which is viewed as a composite value with no delimiters and only a single child).  This unification allows us to keep the code for parsing a composite stream in a single location.  This implementation is a little less concise than the code for parsing an integer, but it follows the same pattern and is fairly instructive:

protected final Value<?> parse() throws IOException {
    if (resolved) {
        throw new IOException("Composite value already resolved");
    }
 
    if (previous != null) {
        if (!previous.isResolved()) {
            previous.resolve();        // ensure we're at the right spot in the stream
        }
    }
 
    byte b = -1;
    if (readAhead instanceof Some) {
        b = readAhead.value();
        readAhead = new None<Byte>();
    } else {
        b = read();
    }
 
    if (b >= 0) {
        Class<? extends Value<?>> valueType = parser.getValueType(b);
 
        if (valueType != null) {
            return previous = Parser.createValue(valueType, parser, is);
        } else if (b > '0' && b <= '9') {
            return previous = readString(b - '0');
        } else if (b == ' ' || b == '\n' || b == '\r' || b == '\t') {
            return parse();        // loop state
        } else {
            throw new IOException("Unexpected character in the parse stream: " 
                + Character.forDigit(b, 10));
        }
    }
 
    throw new IOException("Unexpected end of stream in composite value");
}
 
private final StringValue readString(long length) throws IOException {
    int i = is.read();
 
    if (i >= 0) {
        byte b = (byte) i;
 
        if (b == ':') {
            return Parser.createValue(StringValue.class, parser, 
                new SubStream(is, length));
        } else if (b >= '0' && b <= '9') {
            return readString((length * 10) + b - '0');
        } else {
            throw new IOException("Unexpected character in string value: " 
                + Character.forDigit(i, 10));
        }
    }
 
    throw new IOException("Unexpected end of stream in string value");
}

It seems a bit imposing, but really this code is more of the same logic we saw previously when dealing with integers.  The only value type which really gives us trouble here is string.  We can’t simply treat it like the others because it has no prefix.  For this reason, we must assume that any unbound integer is an inclusive prefix for a string.  In most parser implementations, this would require backtracking, but because we are doing this by hand, we can condense the backtrack into an inherited parameter (borrowing terminology from attribute grammars), avoiding the performance hit.

There’s one final bit of weirdness which deserves attention before we bail on this small epic: dictionary values.  Intuitively, a dictionary value should be parsed into a Java Map, or some sort of associative data structure.  Unfortunately, a map is by definition a random access data structure.  Since we are dealing with a sequential bencode stream, the only recourse to satisfy this property would be to page the entire dictionary into memory.  This of course violates one of the primary requirements which is to avoid using more memory than necessary.

The solution I eventually chose to this problem was to limit dictionary access to sequential, which translates into alphabetical given the nature of bencode dictionaries.  Thus, a dictionary can be parsed in the same way as a list, where each element is a sequential key and value, jointly represented by EntryValue.  To make usage patterns slightly easier, EntryValue memoizes the key and value.  Due to the fact that both of these objects are themselves Value(s), this does not lead to inadvertent memory bloat.

Conclusion

Hopefully the parser and generator presented here will be of some utility in situations where you have to parse large volumes of bencoded data.  The API is (admittedly) bizarre and difficult to deal with, but the performance results are difficult to deny.  This framework is currently deployed in production, where benchmarks have shown that it imposes little-to-no runtime overhead, and practically zero memory overhead (despite the sizeable amounts of data being processed).

For convenience, I actually created a Google Code project for this framework so as to facilitate its development internally to the project I was working on.  The end result of this is unlike most of my experiments, there is actually a proper SVN from which the source may be obtained!  A packaged JAR may be obtained from the downloads section.

Implementing Groovy’s Elvis Operator in Scala

7
Jul
2008

Groovy has an interesting shortening of the ternary operator that it rather fancifully titles “the Elvis Operator“.  This operator is hardly unique to Groovy - C# has had it since 2.0 in the form of the Null Coalescing Operator - but that doesn’t mean that it is not a language feature worth learning from.  Surprisingly (for a C-derivative language), Scala entirely lacks any sort of ternary operator.  However, the language syntax is more than flexible enough to implement something similar without ever having to dip into the compiler.

But before we go there, it is worth examining what this operator does and how it works in languages which already have it.  In essence, it is just a bit of syntax sugar, allowing you to easily check if a value is null and provide a value in the case that it is.  For example:

firstName = "Daniel"
lastName = null
 
println firstName ?: "Chris"
println lastName ?: "Spiewak"

This profound snippet really demonstrates about all there is to the Elvis operator.  The result is as follows:

Daniel
Spiewak

Not terribly exciting.  Essentially, what we have is a binary operator which evaluates the left expression and tests to see if it is null.  In the case of firstName, this is false, so the right expression (in this case, "Chris") is never evaluated.  However, lastName is null, which means that we have to evaluate the right expression and return its value, rather than null.  It’s all just so much syntax sugar that can be expressed equivalently in any language with a conditional operator (in this case, Java):

String firstName = "Daniel";
String lastName = null;
 
System.out.println((firstName == null) ? "Chris" : firstName);
System.out.println((lastName == null) ? "Spiewak" : lastName);

A bit verbose, don’t you think?  Of course, this isn’t really a fair comparison, since Groovy is a far more concise language than Java.  Let’s see how the above would render in a real man’s language like Scala:

val firstName = "Daniel"
val lastName: String = null
 
println(if (firstName == null) "Chris" else firstName)
println(if (lastName == null) "Spiewak" else lastName)

Better, but still a little clumsy.  The truth of the matter is that we’re forced to do this sort of null checking all the time (well, maybe a little less in Scala) and the constructs for doing so are woefully inadequate.  Thus, the motivation for the Elvis operator.

Getting Things Started

Like all good programmers should, we’re going to start with a runnable specification for every behavior desired from the operator.  I’ve written before about the excellent Specs framework, so that’s what we’ll use:

"elvis operator" should {
  "use predicate when not null" in {
    "success" ?: "failure" mustEqual "success"
  }
 
  "use alternative when null" in {
    val test: String = null
    test ?: "success" mustEqual "success"
  }
 
  "type correctly" in {		// if it compiles, then we're fine
    val str: String = "success" ?: "failure"
    val i: Int = 123 ?: 321
 
    str mustEqual "success"
    i mustEqual 123
  }
 
  "infer join of types" in {    // must compile
    val res: CharSequence = "success" ?: new java.lang.StringBuilder("failure")	
    res mustEqual "success"
  }
 
  "only eval alternative when null" in {
    var a = "success"
    def alt = {
      a = "failure"
      a
    }
 
    "non-null" ?: alt
    a mustEqual "success"
  }
}

Fairly straightforward stuff.  I imagine that this specification for the operator is a bit more involved than the one used in the Groovy compiler, due to the fact that Scala is a statically typed language and thus requires a bit more effort to ensure that everything is working properly.  From this specification, we can infer three core properties of the operator:

  1. Basic behavior when null/not-null
  2. The result type should be the unification of the static types of the left and right operands
  3. The right operand should only be evaluated when the left is null

The first property is fairly easy to understand; it is intuitive in the definition of the operator.  All this means is that the value of the operator expression is dependent on the value of the left operand.  When not null, the expression value is equal to the value of the left operand.  If the left operand is null, then the expression is valued equivalent to the right operand.  This is just formally expressing what we spent the first section of the article describing.

Ignoring the second and third properties, we can actually attempt an implementation.  For the moment, we will just assume that the left and right operands must be of exactly the same type, otherwise the operator will be inapplicable.  So, without further ado, implementation enters stage right:

implicit def elvisOperator[T](alt: T) = new {
  def ?:(pred: T) = if (pred == null) alt else pred
}

Notice the use of the anonymous inner class to carry the actual operator?  This is a fairly common trick in Scala to avoid the definition of a full-blown class just for the sake of adding a method to an existing type.  To break down what’s going on here, we have defined an implicit type conversion from any type T to our anonymous inner class.  This conversion will be inserted by the compiler whenever we invoke the ?: operator on an expression.

Sharp-eyed developers will notice something a little odd about the way this code is structured.  In fact, if you look closely, it seems that we evaluate the right operand and use its value if non-null (otherwise left), which is exactly the opposite of what our specification defines.  For a normal operator, this observation would be quite correct.  However, Scala defines the associatively of operators based on the trailing symbol.  In this case, because our trailing symbol is a colon (:), the operator itself will be right-associative.  Thus, the following expression:

check ?: alternate

…is transformed by the compiler into the following:

alternate.?:(check)

This is how right-associative operators function, by performing method calls on the right operand.  Thus, we need to define our implicit conversion such that the ?: method will be defined for the right operand, taking the left operand as a parameter.  We’ll see a bit later on how this can cause trouble, but for now, let’s continue with the specification.

A Little Type Theory

The second property is a little tougher.  Type unification is one of those pesky issues that plague statically typed languages and are simply irrelevant in those with dynamic type systems.  The issue arises from the following question: what happens if the left and right operands are of different types?  In Groovy, this is a non-issue because the value of the expression is simply dynamically typed according to the runtime type of the operand which is chosen.  However, Scala requires static type information, which means that we need to ensure that the static type of the expression is sound for either the left or the right operand (since Scala does not have non-nullable types).  The best way to do this is to compute the least upper bound of the two types, an operation which is also known as minimal unification.  Consider the following hierarchy:

image

Now imagine that the left operand is of static type Apple, while the right operand is of static type Pear.  We need to find a static type which is safe for both of these.  Intuitively, this type would be Fruit, since it is a common superclass of both Apple and Pear.  Regardless of which expression is chosen at runtime, we will be able to polymorphically treat the value as a value of type Fruit.  The intuition in this case is quite correct.  In fact, it actually has a rigorous mathematical proof…which I won’t go into.  (queue sighs of relief)

One additional example should serve to really drive the point home.  Consider the scenario where the left operand has type Vegitable and the right operand has type Apple.  This is a bit trickier, but it recursively boils down to the same case.  The only common superclass between these two types is Object, due to the fact that the hierarchies are disjoint.

This operation is fairly easy to perform by hand given the full type hierarchy.  For that matter, it isn’t very difficult to write an algorithm which can efficiently compute the minimal unification of two types.  Unfortunately, we don’t have that luxury here.  We cannot simply write code which is executed at compile time to determine type information, we must make use of the existing Scala type system in order to “trick” the compiler into inferring things for us.  We do this by making use of lower-bounds on type parameters.  With this in mind, we can (finally) make a first attempt at a well-typed implementation of the operator:

implicit def elvisOperator[T](alt: T) = new {
  def ?:[A >: T](pred: A) = if (pred == null) alt else pred
}

The only thing we have changed is the type of the pred variable from T to a new type parameter, A.  This new type parameter is defined by the lower-bound T.  Translated into English, the type expression reads something like the following:

Accept parameter pred of some type A which is a super-type of T.

The real magic of the expression is that pred need not be exactly of type A; it could also be a subtype.  Thus, A is some generic supertype which encompasses both the types of the left and the right operands.

Fancy Parameter Types

This allows us to move onto the third property: only evaluate the right operand if the left is null.  This is the normal behavior for conditional expressions.  After all, you wouldn’t want your code performing an expensive operation (such as grabbing data from a server somewhere) just to throw away the result because a different branch of the conditional was chosen.  Actually, the bigger issue with ignoring this property (as we have done so far) is that the right operand may actually have side-effects.  Scala isn’t a pure functional language, so evaluating expressions that we don’t need (or worse, that the developer isn’t expecting) can have extremely dire consequences.

Unfortunately, at first glance, there doesn’t really seem to be a way to avoid this evaluation.  After all, we need to invoke the ?: method on something.  We could try using a left-associative operator instead (such as C#’s ?? operator), but even that wouldn’t fully solve the problem as we would still need to pass the right operand as a parameter.  In short, it seems like we’re stuck.

The good news is that Scala’s designers chose to adopt an age-old construct known as “pass-by-name parameters”.  This technique dates all the way back to ALGOL (possibly even further).  In fact, it’s so old and obscure that I’ve actually had professors tell me that it has been completely abandoned in favor of the more conventional pass-by-value (what Java, C#, Scala and most languages use) and pass-by-reference (which is available in C++).  Pass-by-name parameters are very much like normal parameters in that they are used to copy values from a calling scope into the method in question.  However, unlike normal parameters, they are evaluated on an as-needed basis.  This means that a pass-by-name parameter will only be evaluated if its value is required within the method called.  For example:

def doSomething(a: =>Int) = 1 + 2
def createInteger() = {
  println("Made integer")
  42
}
 
println("In the beginning...")
doSomething(createInteger())
println("...at the end")

Counter to our first intuition, this will print the following:

In the beginning...
...at the end

In other words, the createInteger method is never called!  This is because the value of the pass-by-name parameter in the doSomething method is never accessed, meaning that the value of the expression is not needed.  The a parameter is denoted pass-by-name by the => notation (just in case you were wondering).  We can apply this to our implementation by changing the parameter of the implicit conversion from pass-by-value to pass-by-name:

implicit def elvisOperator[T](alt: =>T) = new {
  def ?:[A >: T](pred: A) = if (pred == null) alt else pred
}

The language-level implementation of the if/else conditional expression will ensure that the alt parameter is only accessed iff the value of pred is null, meaning we have finally satisfied all three properties.  We can check this by compiling and running our specification from earlier:

Specification "TernarySpecs"
  elvis operator should
  + use predicate when not null
  + use alternative when null
  + type correctly
  + infer join of types
  + only eval alternative when null

Total for specification "TernarySpecs":
Finished in 0 second, 78 ms
5 examples, 6 assertions, 0 failure, 0 error

Conclusion

We now have a working implementation of Groovy’s Elvis operator within Scala and we never had to move beyond simple API design.  Truly, one of Scala’s greatest strengths is its ability to expression extremely complex constructs within the confines of the language.  This makes it uniquely well-suited to hosting internal domain-specific languages.  Using techniques similar to the ones I have outlined in this article, it is possible to define operations which would require compiler-level implementation in most languages.

The full source (such as it is) for the Elvis operator in Scala is available for download, along with a bonus implementation of C#’s ?? syntax (just in case you prefer it).  The implementation differs slightly due to the fact that ?? is a left-associative operator, but the single-use (unchained) semantics are identical.  Enjoy!

Formal Language Processing in Scala

16
Jun
2008

Quite some time ago, a smart cookie named Phillip Wadler authored a publication explaining the concept of “parser combinators”, a method of representing the well-understood concept of text parsing as the composition of atomic constructs which behaved according to monadic law.  This idea understandably captured the imaginations of a number of leading researchers, eventually developing into the Haskell parsec library.  Being a functional language with academic roots, it is understandable that Scala would have an implementation of the combinator concept included in its standard library.

The inclusion of this framework into Scala’s core library has brought advanced text parsing to within the reach of the “common man” perhaps for the first time.  Of course, tools like Bison, ANTLR and JavaCC have been around for a long time and gained quite a following, but such tools often have a steep learning curve and are quite intimidating for the casual experimenter.  Parser combinators are often much easier to work with and can streamline the transition from “experimenter” to “compiler hacker”.

Of course, all of this functional goodness does come at a price: flexibility.  The Scala implementation of parser combinators rules out any left-recursive grammars.  Right-recursiveness is supported (thanks to call-by-name parameters), but any production rule which is left-recursive creates an infinitely recursive call chain and overflows the stack.  It is possible to overcome this limitation with right-associative operators, but even if such an implementation existed in Scala, it wouldn’t do any good.  Scala’s parser combinators effectively produce an LL(*) parser instead of the far more flexible LR or even better, LALR such as would be produced by Bison or SableCC.  It is unknown to me (and everyone I’ve asked) whether or not it is even possible to produce an LR parser combinitorially, though the problem of left-recursion in LL(*) has been studied at length.

The good news is that you really don’t need all of the flexibility of an LR parser for most cases.  (in fact, you don’t theoretically need LR at all, it’s just easier for a lot of things)  Parser combinators are capable of satisfying many of the common parsing scenarios which face the average developer.  Unless you’re planning on building the next Java-killing scripting language, the framework should be just fine.

Of course, any time you talk about language parsing, the topic of language analysis and interpretation is bound to come up.  This article explores the construction of a simple interpreter for a trivial language.  I chose to create an interpreter rather than a compiler mainly for simplicity (I didn’t want to deal with bytecode generation in a blog post).  I also steer clear of a lot of the more knotty issues associated with semantic analysis, such as type checking, object-orientation, stack frames and the like.To be honest, I’ve been looking forward to writing this article ever since I read Debasish Ghosh’s excellent introduction to external DSL implementation in Scala.  Scala’s parser combinators are dreadfully under-documented, especially when you discount purely academic publications.  Hopefully, this article will help to rectify the situation in some small way.

Simpletalk: Iteration 1

Before we implement the interpreter, it is usually nice to have a language to interpret.  For the purposes of this article, we will be using an extremely contrived language based around the standard output (hence the name: “simple” “talk”).  The language isn’t complete (even in the later iterations) and you would be hard-pressed to find any useful application; but, it makes for a convenient running example to follow.

In the first iteration, Simpletalk is based around two commands: print and space.  Two hard-coded messages are available for output via print: HELLO and GOODBYE.  We will increase the complexity of the language later, allowing for literals and alternative constructs, but for the moment, this will suffice.  An example Simpletalk program which exercises all language features could be as follows:

print HELLO
space
space
print GOODBYE
space
print HELLO
print GOODBYE

The output would be the following:

Hello, World!

Farewell, sweet petunia!

Hello, World!
Farewell, sweet petunia!

As I said, not very useful.

The first thing we need to do is to define a context-free grammar using the combinator library included with Scala.  The language itself is simple enough that we could write the parser by hand, but it is easier to extend a language with a declarative definition.  Besides, I wanted an article on using parser combinators to build an interpreter…

object Simpletalk extends StandardTokenParsers with Application {
  lexical.reserved += ("print", "space", "HELLO", "GOODBYE")
 
  val input = Source.fromFile("input.talk").getLines.reduceLeft[String](_ + '\n' + _)
  val tokens = new lexical.Scanner(input)
 
  val result = phrase(program)(tokens)
 
  // grammar starts here
  def program = stmt+
 
  def stmt = ( "print" ~ greeting
             | "space" )
 
  def greeting = ( "HELLO"
                 | "GOODBYE" )
}

All of this is fairly standard EBNF notation.  The one critical bit of syntax here is the tilde (~) operator, shown separating the "print" and greeting tokens.  This method is the concatenation operator for the parsers.  Literally it means: first "print", then parse greeting, whatever that entails.  What would normally be terminals in a context-free grammar are represented by full Scala methods.  Type inference makes the syntax extremely concise.

It’s also worth noticing the use of the unary plus operator (+) to specify a repetition with one or more occurrences.  This is standard EBNF and implemented as perfectly valid Scala within the combinator library.  We could have also used the rep1(...) method, but I prefer the operator simply because it is more notational.

Looking back up toward the top of the Simpletalk singleton, we see the use of lexical.reserved.  We must define the keywords used by our language to avoid the parser marking them as identifiers.  To save time and effort, we’re going to implement the interpreter by extending the StandardTokenParsers class.  This is nice because we get a lot of functionality for free (such as parsing of string literals), but we also have to make sure that our language is relatively Scala-like in its syntax.  In this case, that is not a problem.

Down the line, we initialize the scanner and use it to parse a result from the hard-coded file, input.talk.  It would seem that even if we wanted to use Simpletalk for some useful application, we would have to ensure that the entire script was contained in a single, rigidly defined file relative to the interpreter.

Defining the AST

Once the grammar has been created, we must create the classes which will define the abstract syntax tree.  Almost any language defines a grammar which can be logically represented as a tree.  This tree structure is desirable as it is far easier to work with than the raw token stream.  The tree nodes may be manipulated at a high-level in the interpreter, allowing us to implement advanced features like name resolution (which we handle in iteration 3).

At the root of our AST is the statement, or rather, a List of statements.  Currently, the only statements we need to be concerned with are print and space, so we will only need to define two classes to represent them in the AST.  These classes will extend the abstract superclass, Statement.

The Space class will be fairly simple, as the command requires no arguments.  However, Print will need to contain some high-level representation of its greeting.  Since there are only two greetings available and as they are hard-coded into the language, we can safely represent them with separate classes extending a common superclass, Greeting.  The full hierarchy looks like this:

sealed abstract class Statement
 
case class Print(greeting: Greeting) extends Statement
case class Space extends Statement
 
sealed abstract class Greeting {
  val text: String
}
 
case class Hello extends Greeting {
  override val text = "Hello, World!"
}
 
case class Goodbye extends Greeting {
  override val text = "Farewell, sweet petunia!"
}

Believe it or not, this is all we need to represent the language as it stands in tree form.  However, the combinator library does not simply look through the classpath and guess which classes might represent AST nodes, we must explicitly tell it how to convert the results of each parse into a node.  This is done using the ^^ and ^^^ methods:

def program = stmt+
 
def stmt = ( "print" ~ greeting ^^ { case _ ~ g => Print(g) }
           | "space" ^^^ Space() )
 
def greeting = ( "HELLO" ^^^ Hello()
               | "GOODBYE" ^^^ Goodbye() )

The ^^^ method takes a parameter as a literal value which will be returned if the parse is successful.  Thus, any time the space command is parsed, the result will be an instance of the Space class, defined here.  This is nicely compact and efficient, but it does not satisfy all cases.  The print command, for example, takes a greeting argument.  To allow for this, we use the ^^ method and pass it a Scala PartialFunction.  The partial function defines a pattern which is matched against the parse result.  If it is successful, then the inner expression is resolved (in this case, Print(g)) and the result is returned.  Since the Parser defined by the greeting method is already defined to return an instance of Greeting, we can safely pass the result of this parse as a parameter to the Print constructor.  Note that we need not define any node initialization for the program terminal as the + operator is already defined to return a list of whatever type it encapsulates (in this case, Statement).

The Interpreter

So far, we have been focused exclusively on the front side of the interpreter: input parsing.  Our parser is now capable of consuming and checking the textual statements from input.talk and producing a corresponding AST.  We must now write the code which walks the AST and executes each statement in turn.  The result is a fairly straightforward recursive deconstruction of a list, with each node corresponding to an invocation of println.

class Interpreter(tree: List[Statement]) {
  def run() {
    walkTree(tree)
  }
 
  private def walkTree(tree: List[Statement]) {
    tree match {
      case Print(greeting) :: rest => {
        println(greeting.text)
        walkTree(rest)
      }
 
      case Space() :: rest => {
        println()
        walkTree(rest)
      }
 
      case Nil => ()
    }
  }
}

This is where all that work we did constructing the AST begins to pay off.  We don’t even have to manually resolve the greeting constants, that can be handled polymorphically within the nodes themselves.  Actually, in a real interpreter, it probably would be best to let the statement nodes handle the actual execution logic, thus enabling the interpreter to merely function as a dispatch core, remaining relatively agnostic of language semantics.

The final piece we need to tie this all together is a bit of logic handling the parse result (assuming it is successful) and transferring it to the interpreter.  We can accomplish this using pattern matching in the Simpletalk application:

result match {
  case Success(tree, _) => new Interpreter(tree).run()
 
  case e: NoSuccess => {
    Console.err.println(e)
    exit(100)
  }
}

We could get a bit fancier with our error handling, but in this case it is easiest just to print the error and give up.  The error handling we have is sufficient for our experimental needs.  We can test this with the following input:

print errorHere
space

The result is the following TeX-like trace:

[1.7] failure: ``GOODBYE'' expected but identifier errorHere found

print errorHere

      ^

One of the advantages of LL parsing is the parser can automatically generate relatively accurate error messages, just by inspecting the grammar and comparing it to the input.  This is much more difficult with an LR parser, which allows the parse process to consume multiple production rules simultaneously.

Pat yourself on the back!  This is all that is required to wire up a very simple language interpreter.  The full source is linked at the bottom of the article.

Iteration 2

Now that we have a working language implementation, it’s time to expand upon it.  After all, we can’t leave things working for long, can we?  For iteration 2, we will add the ability to print arbitrary messages using string and numeric literals, as well as a simple loop construct.  This loop will demonstrate some of the real merits of representing the program as a tree rather than a simple token stream.  As for syntax, we will define an example of these features as follows:

print HELLO
print 42

space
repeat 10
  print GOODBYE
next

space
print "Adios!"

The result should be the following:

Hello, World!
42

Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!

Adios!

The first thing we will need to do to support these new features is update our grammar.  This is where we start to see the advantages of using a declarative API like Scala’s combinators as opposed to creating the parser by hand.  We will need to change stmt terminal (to accept the repeat command) as well as the greeting terminal (to allow string and numeric literals):

def stmt: Parser[Statement] = ( "print" ~ greeting ^^ { case _ ~ g => Print(g) }
                              | "space" ^^^ Space()
                              | "repeat" ~ numericLit ~ (stmt+) ~ "next" ^^ {
                                  case _ ~ times ~ stmts ~ _ => Repeat(times.toInt, stmts)
                                } )
 
def greeting = ( "HELLO" ^^^ Hello()
               | "GOODBYE" ^^^ Goodbye()
               | stringLit ^^ { case s => Literal(s) }
               | numericLit ^^ { case s => Literal(s) } )

Notice that we can no longer rely upon type inference in the stmt method, as it is now recursive.  This recursion is in the repeat rule, which contains a one-or-more repetition of stmt.  This is logical since we want repeat to contain other statements, including other instances of repeat.  The repeat rule also makes use of the numericLit terminal.  This is a rule which is defined for us as part of the StandardTokenParsers.  Technically, it is more accepting than we want since it will also allow decimals.  However, we don’t need to worry about such trivialities.  After all, this is just an experiment, right?

The numericLit and stringLit terminals are used again in two of the productions for greeting.  Both of these parsers resolve to instances of String, which we pattern match and encapsulate within a new AST node: Literal.  It makes sense for numericLit to resolve to String because Scala has no way of knowing how our specific language will handle numbers.

These are the new AST classes required to satisfy the language changes:

case class Repeat(times: Int, stmts: List[Statement]) extends Statement
 
case class Literal(override val text: String) extends Greeting

Literal merely needs to encapsulate a String resolved directly out of the parse, there is no processing required.  Repeat, on the other hand, has a bit more interest to it.  This class contains a list of Statement(s), as well as an iteration count, which will define the behavior of the repeat when executed.  This is our first example of a truly recursive AST structure. Repeat is defined as a subclass of Statement, and it contains a List of such Statement(s).  Thus, it is conceivable that a Repeat could contain another instance of Repeat, nested within its structure.  This is really the true power of the AST: the ability to represent a recursive grammar in a logical, high-level structure.

Of course, Interpreter also must be modified to support these new features; but because of our polymorphic Greeting design, we only need to worry about Repeat.  This node is easily handled by adding another pattern to our recursive match:

case Repeat(times, stmts) :: rest => {
  for (i <- 0 until times) {
    walkTree(stmts)
  }
 
  walkTree(rest)
}

Here we see the primary advantage to leaving the execution processing within Interpreter rather than polymorphically farming it out to the AST: direct access to the walkTree method.  Logically, each repeat statement contains a new Simpletalk program within itself.  Since we already have a method defined to interpret such programs, it only makes sense to use it!  The looping itself can be handled by a simple for-comprehension.  Following the loop, we deconstruct the list and move on to the next statement in the enclosing scope (which could be a loop itself).  This design is extremely flexible and capable of handling the fully recursive nature of our new language.

The only other change we need to make is to add our new keywords to the lexer, so that they are not parsed as identifiers:

lexical.reserved += ("print", "space", "repeat", "next", "HELLO", "GOODBYE")

Iteration 3

It’s time to move onto something moderately advanced.  So far, we’ve stuck to easy modifications like new structures and extra keywords.  A more complicated task would be the addition of variables and scoping.  For example, we might want a syntax something like this:

let y = HELLO

space
print y
let x = 42
print x

space
repeat 10
  let y = GOODBYE
  print y
next

space
print y

space
print "Adios!"

And the result:

Hello, World!
42

Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!
Farewell, sweet petunia!

Hello, World!

Adios!

This is significantly more complicated than our previous iterations partially because it requires name resolution (and thus, some semantic analysis), but more importantly because it requires a generalization of expressions.  We already have expressions of a sort in the Greeting nodes.  These are not really general-purpose expressions as they do not resolve to a value which can be used in multiple contexts.  This was not previously a problem since we only had one context in which they could be resolved (print).  But now, we will need to resolve Greeting(s) for both print and let statements.

We will start with the AST this time.  We need to modify our Greeting superclass to allow for more complex resolutions than a static text value.  More than that, these resolutions no longer take place in isolation, but within a variable context (referencing environment).  This context will not be required for Literal, Hello or Goodbye expressions, but it will be essential to handle our new AST: Variable.

sealed abstract class Statement
 
case class Print(expr: Expression) extends Statement
case class Space extends Statement
 
case class Repeat(times: Int, stmts: List[Statement]) extends Statement
case class Let(val id: String, val expr: Expression) extends Statement
 
sealed abstract class Expression {
  def value(context: Context): String
}
 
case class Literal(text: String) extends Expression {
  override def value(context: Context) = text
}
 
case class Variable(id: String) extends Expression {
  override def value(context: Context) = {
    context.resolve(id) match {
      case Some(binding) => binding.expr.value(context)
      case None => throw new RuntimeException("Unknown identifier: " + id)
    }
  }
}
 
case class Hello extends Expression {
  override def value(context: Context) = "Hello, World!"
}
 
case class Goodbye extends Expression {
  override def value(context: Context) = "Farewell, sweet petunia!"
}

Notice that the Print and Let nodes both accept instances of Expression, rather than the old Greeting node.  This generalization is quite powerful, allowing variables to be assigned the result of other variables, literals or greetings.  Likewise, the print statement may also be used with variables, literals or greetings alike.

Actually, the real meat of the implementation is contained within the Context class (not shown).  This data structure will manage the gritty details of resolving variable names into let-bindings (which can then be resolved as expressions).  Additionally, Context must deal with all of the problems associated with nested scopes and name shadowing (remember that our motivation example shadows the definition of the y variable).

For the moment, we need not concern ourselves with reassignment (mutability).  Redefinition will be allowed, but simplicity in both the grammar and the interpreter calls for fully immutable constants, rather than true variables.  Additionally, this restriction allows our interpreter to easily make use of fully immutable data structures in its implementation.  Scala doesn’t really impose this as an implementation requirement, but it’s neat to be able to do.  Also, the pure-functional nature of the data structures provide greater assurance as to the correctness of the algorithm.

Here is the full implementation of both Context and the modified Interpreter:

class Interpreter(tree: List[Statement]) {
  def run() {
    walkTree(tree, EmptyContext)
  }
 
  private def walkTree(tree: List[Statement], context: Context) {
    tree match {
      case Print(expr) :: rest => {
        println(expr.value(context))
        walkTree(rest, context)
      }
 
      case Space() :: rest => {
        println()
        walkTree(rest, context)
      }
 
      case Repeat(times, stmts) :: rest => {
        for (i <- 0 until times) {
          walkTree(stmts, context.child)
        }
 
        walkTree(rest, context)
      }
 
      case (binding: Let) :: rest => walkTree(rest, context + binding)
 
      case Nil => ()
    }
  }
}
 
class Context