Skip to content

Scala Collections for the Easily Bored Part 2: One at a Time

28
Jul
2008

As I hinted previously, this series is intended to delve into Scala’s extensive collections API and the many ways in which it can make your life easier.  Probably the most important operations you could ever perform on collections are those which examine each element, one at a time.  After all, what’s a more common array idiom than looping over all values?

In that vein, this article starts by looking at foreach, the imperative programmer’s bread-and-butter when it comes to types like Array and List.  But rather than just stopping there, we also will look at more powerful, higher-order operations like fold, map and the ever-mysterious: flatMap.

Iterating

As I said above, looping over every item in a collection is probably the most heavily used operation in a programmer’s repertoire.  In fact, this pattern is so common in imperative languages that Java 5 saw the introduction of a special construct at the language level, just to make things a little easier.  For example:

String[] names = { "Daniel", "Chris", "Joseph" };
for (String name : names) {
    System.out.println(name);
}

This code should be old hat to anyone coming from a Java background.  Under the surface, this code is compiled into a while-loop with an iterator and an increment operation.  The code steps through the array, assigning each successive element to name.  Most statically typed languages have a construct similar to this.  For example, C# offers the foreach statement, which is almost precisely similar to Java’s enhanced for-loop, but with a slightly different syntax.  However, some languages (like Ruby) idiomatically eschew loops and rely instead on higher-order methods.  Translating the above into Ruby yields the following:

names = ['Daniel', 'Chris', 'Joseph']
names.each do |name|
  puts name
end

In this case, we aren’t using a loop of any kind, but rather creating a block (Ruby’s name for a closure) which takes a single parameter and passes it to the built-in puts method.  This block is then passed as an object to the each method of class Array, which calls the block once for each element in series.  Certainly, there is a language construct which encapsulates this, but using the each method directly is considered more “Ruby”.

The same approach is taken in Scala.  Rather than define a special construct for iteration, Scala simply provides the syntactic tools needed to construct higher-order methods like Ruby’s each.  Every collection in Scala’s library defines (or inherits) a foreach method (taking after its C# ancestry) which behaves precisely like Ruby’s each.  To show how, we will translate our example once more, this time into Scala:

val names = Array("Daniel", "Chris", "Joseph")
names.foreach { name =>
  println(name)
}

Here we define an anonymous method (Scala’s name for a closure) which takes a single parameter.  As in Ruby, this closure is passed to foreach and invoked for each array element.  In this way, foreach is a a so-called “higher-order” method, due to the fact that it accepts a parameter which is itself another method.  Scala’s implementation of this concept is actually a bit more general than Ruby’s, allowing us to shorten our example into the following:

val names = Array("Daniel", "Chris", "Joseph")
names.foreach(println)

This time, instead of creating an anonymous method to pass to foreach, we simply pass the println method itself.  The only criterion that foreach imposes on its parameter is that it is a method which accepts a single parameter of type String (the element type of the array).  Since we already have just such a method (println), there is no need to define another simply for encapsulation.

Unfortunately, despite its flexibility, foreach is not always the most syntactically concise way to iterate over a collection.  There are times that we just want to use a syntax which is similar to the for-loops available in other languages.  Well, fear not!  Scala’s for-comprehensions are more than capable of providing just such a syntax.  Consider the example of imperatively summing the values of a list:

val nums = List(1, 2, 3, 4, 5)
 
var sum = 0
for (n <- nums) {
  sum += n
}

At compile-time, the above is actually translated into an equivalent call to foreach, passing the body of the loop as the anonymous method.  This means that any class which correctly declares a foreach method may be used in a for-comprehension in this way, greatly reducing the syntactic overhead.

Folding

Looping is nice, but sometimes there are situations where it is necessary to somehow combine or examine every element in a collection, producing a single value as a result.  An example of this could be our previous example of summing a list.  Using foreach, we had to make use of a mutable variable (sum) and produce the result as a side effect.  This is fine for hybrid languages like Scala, but some languages actually lack mutable variables altogether.  In the previous post, I mentioned ML, which is a pure-functional language (almost).  Since pure-functional languages lack mutable state, the gods of computing needed to come up with some other way to accommodate this particular case.  The solution is called “folding”.

Folding a collection is literally the process of looking at each element in addition to a current accumulator and returning some value.  To make things more clear, let’s re-write our list summing example in a functional style:

val nums = List(1, 2, 3, 4, 5)
val sum = nums.foldLeft(0) { (total, n) =>
  total + n
}

It may seem a bit disguised, but this too is a loop, just like foreach.  For each element in the list (starting from the left), the foldLeft method will call the anonymous method we passed to it, providing both the current element, as well as the total accumulated from the previous call.  Since there was no previous call when the first element is processed, we must specify an initial value for total - in this case, 0.  Literally, the above can be flattened into the following:

val sum = 0 + 1 + 2 + 3 + 4 + 5

Of course, we would never want to hard-code a list in this fashion, but it serves as a sufficient illustration of the functionality.  I know from experience that when you first discover fold it’s difficult to see why anyone would want to use a construct so limited.  After all, doesn’t it just serve to obscure the true meaning of the code?  Well, take my word for it, fold is an almost indispensable tool…once you get to know it a little better.  Try keeping an eye out for times in your own code where a fold might be useful instead of a conventional loop.  You’ll be surprised how often it can be used to solve a problem, sometimes one not even intuitively related to accumulation.

There’s no special language-level syntax for fold, but Scala’s powerful operator overloading mechanism has allowed the designers of the collections API to create a special operator of rather dubious readability.  To illustrate, here’s our “summing a list” example once more:

val nums = List(1, 2, 3, 4, 5)
(0 /: nums) {_+_}

Yeah, I can’t read it either.  This example is semantically equivalent to the previous fold, but its meaning is a bit obfuscated by a) the bizarre right-associative operator, and b) a mysterious cameo by Scala’s ubiquitous underscore.  While I don’t have a problem using the underscore syntax in my own code, I don’t think that the fold operator improves anything other than number of characters.  I suppose it’s a matter of taste though.

Reduce

Fold has a closely related operation in Scala called “reduce” which can be extremely helpful in merging the elements of a sequence where leading or trailing values might be a problem.  Consider the ever-popular example of transforming a list of String(s) into a single, comma-delimited value:

val names = List("Daniel", "Chris", "Joseph")
val str = names.foldLeft("") { (acc, n) =>
  acc + ", " + n
}
 
println(str)

If you compile and run this code, you will actually arrive at a result which looks like the following:

, Daniel, Chris, Joseph

This is because folding a list requires a leading value, but this means that we have an extra separator running wild.  We could try a foldRight, but this would merely move the same problem to the tail of the string.  Interestingly enough, this problem of leading/trailing separators is hardly specific to folding.  I can’t tell you how many times I ran into this issue when constructing highly-imperative query synthesis algorithms for ActiveObjects.

The easiest way to solve this problem in Scala is to simply use a reduce, rather than a fold.  As a rule, any collection which defines foldLeft will also define reduceLeft (and the corresponding right methods).  Reduce distinguishes itself from fold in that it does not require an initial value to “prime the sequence” as it were.  Rather, it starts with the very first element in the sequence and moves on to the end.  Thus, the following code will produce the desired result for our names example:

val names = List("Daniel", "Chris", "Joseph")
val str = names.reduceLeft[String] { (acc, n) =>
  acc + ", " + n
}
 
println(str)

There are of course a few small problems with this approach.  Firstly, it is not as general as a fold.  Reduce is designed primarily for the iterate/accumulate pattern, whereas fold may be applied to many problems (such as searching a list).  Also, the reduce method will throw an exception if the target collection is empty.  Finally, Scala’s type inference isn’t quite clever enough to figure out what’s going on without the explicit [String] type parameter (since our result is of type String).  Still, even with all these shortcomings, reduce can be a very powerful tool in the right hands.

Mapping

We’ve seen how fold can be an extremely useful tool for applying a computation to each element in a collection and arriving at a single result, but what if we want to apply a method to every element in a collection in-place (as it were), creating a new collection of the same type with the modified elements?  Coming from an imperative background, this probably sounds a little abstract, but like fold, map can be extremely useful in many common scenarios.  Consider the example of transforming a list of String(s) into a list of Int(s):

val strs = List("1", "2", "3", "4", "5")
val nums = strs.map { s =>
  s.toInt
}
 
nums == List(1, 2, 3, 4, 5)   // true

The final expression in this snippet is just to illustrate what really happens to the list elements when map is called.  Literally, the map method walks through each element in the list, calls the provided method and then stores the result in the corresponding index of a new list.  (list is immutable, remember?)  If you think about it, this is very similar to looping with foreach except that at each iteration we produce a value which is saved for future use.

Another common application of this technique might be to cast an entire array from one type to another.  I often make use of XMLRPC, which has the unfortunate property of stripping all type information from its composite types.  Thus, I often find myself writing code like this:

public void rpcMethod(Object[] params) {
    String[] strParams = new String[params.length];
    for (int i = 0; i < params.length; i++) {
        strParams[i] = (String) params[i];
    }
}

It’s a lot of trouble to go through, but I really don’t know any better way.  We can’t just cast the array to String[], since the array itself is not of type String[], only its elements match that type.  Java doesn’t support higher-order operations such as map, but fortunately Scala does.  We can translate the above into a functional style and gain tremendously in both readability and conciseness:

def rpcMethod(params: Array[Object]) {
  val strParams = params.map { _.asInstanceOf[String] }
}

For the sake of brevity, you’ll notice that I made use of the underscore syntax as a placeholder for the anonymous method parameter.  This syntax works remarkably well for short operations like the above, where all we need to do is take the input value and cast it to a new type.

As it turns out, mapping over a collection is a phenomenally common operation, perhaps even more so than fold.  For that reason, the creators of Scala decided that it was worth adding a special syntax sugar built into the powerful for-comprehension mechanism.  With a little bit of tweaking, we can transform our casting example into an arguably more readable form:

def rpcMethod(params: Array[Object]) {
  val strParams = for (p <- params) yield p.asInstanceOf[String]
}

At compile-time, these two forms are literally equivalent.  In some ways it is a matter of taste as to which is better.  I personally tend to favor directly calling map for simple, non-combinatorial operations like this, but to each his own.

Binding

Actually, the name “bind” comes from Haskell.  Scala’s term for this operation is “flatMap” because the operation may be viewed as a combination of the map and flatten methods.  Of all of the techniques discussed so far, this one probably has the richest theoretical implications.  Coming straight from the menacing jungles of category theory and the perplexing wasteland of monads, flatMap is both intriguing and apparently useless (at first glance anyway).

Like map, flatMap walks through every element in a collection and applies a given function, saving the value for later use.  However, unlike map, flatMap expects the return type of the specified function to be the same as the enclosing collection with an optionally different type parameter.  If we look at this in terms of our number-converting example from previously, this means that our anonymous method must not return a value of type Int, but rather of type List[Int].  Once flatMap has all of the resultant List[Int] values, it flattens them into a single list containing all of the elements from each of the inner lists.

Ok, that was utterly un-helpful.  Maybe the method signature would be more illuminating:

class List[A] {   // don't try this at home
  def flatMap[B](f: (A)=>List[B]): List[B]
}

Other than forcing order of evaluation, I can’t personally think of too many common cases where this sort of operation is useful.  However, one contrived example does spring to mind.  Consider once more the example of converting a list of String(s) into a list of Int(s), but this time assume that some of the String elements might not nicely convert into integer values:

val strs = List("1", "two", "3", "four", "five")
val nums = strs.flatMap { s =>
  try {
    List(s.toInt)
  } catch {
    case _ => Nil
  }
}
 
nums == List(1, 3)    // true

Recall that in Scala, everything is an expression - including try/catch blocks - therefore, everything has a value.  This code literally walks through the entire list and tries to convert each element into an integer and wrap the result in a List.  If the conversion fails (for whatever reason), an empty list is returned (Nil).  Because we return an empty list for those elements which cannot be converted, flatMap literally resolves those results out of existence, leading to a List which only contains two Int(s).  For the monadically uninclined among us, this is precisely the reason why Nil is referred to as the “zero” of the List monad.  However, that’s a topic for an entirely different series

Conclusion

Ok, so this article was a bit longer than I really wanted to run, but there’s a lot of material to cover!  Scala’s collections framework shows how even operations steeped in mountains of theory can still prove useful in solving common problems.  Now, every time I use collections in Java (or even Ruby), I find myself reaching for many of these same methods, only to find them either unavailable or less powerful than I would like.  Scala provides a welcome refuge for all those of us who desire more powerful collection types in our daily programming.

Be with us next time for filter, forall, exists and more!

Scala Collections for the Easily Bored Part 1: A Tale of Two Flavors

21
Jul
2008

One of the most obvious things to a Java developer first coming into Scala-land is the radically different Collections API included as part of the standard library.  For the most part, we use the same frameworks and APIs in Scala as are available in Java.  This is natural, thanks to the extremely tight integration between the two languages.  So why is this one area such a startling departure from Scala’s heritage?

The answer has to do with what the Scala language is syntactically capable of handling.  Scala isn’t just an object-oriented language, it is also highly functional.  It is only natural that such an integral part of the core libraries would reflect this fact.  Unfortunately, most developers fail to take full advantage of the power offered by the collections API.  Despite the available power, most code written using Scala’s collections tends to look a lot like Java in disguise.

I had actually planned on addressing this topic in a single article.  However, Scala’s collections are so vast and powerful (sounds like one of Roddenberry’s alien consciousness beings) that it really would overrun the limits of conventional blogging to attempt to cover it all in a single post.  Despite the fact that I’ve been creating mammoth anthologies of late, I think it’s probably better to break it into bite-sized chunks.  First up: the confusing dual nature of Scala’s collections API!

A Tale of Two Flavors

The very first thing developers notice when looking at Scala’s collections is a (seemingly) odd redundancy in the specification.  Looking under the scala.collection package, we see not one, but three separate sub-packages, each containing what seem to be reimplementations of the same types.  For example, consider the following three traits:

I don’t know about you, but this confused the heck out of me the first time I really dug into Scala’s standard API.  Actually, it gets even worse when you discover that there is also a trait (and companion object), scala.collection.Map, which is actually a super-trait of the three listed above.  Seems like Dr. Odersky discovered the magic of separated namespaces and reacted like a two-year-old on espresso.

As it turns out, there’s a very logical reason for having these separated and seemingly-redundant collections packages.  Of the three, one of them can be discounted immediately as uninteresting.  The jcl package contains collections, but they are merely wrappers for the corresponding Java collections.  This allows more efficient transmutation between the Java collections API and Scala.  It is almost never necessary to use this package directly, since a number of implicit conversions are built into Scala to make the process essentially transparent.

Of course, this still leaves the immutable and mutable packages.  This distinction traces back to some of Scala’s functional roots.  As you are likely aware, Scala supports both mutable and immutable variables, as denoted by the var and val keywords, respectively.  While there are some significant differences at compile-time, conceptually, the only distinction between these two types is the former allows reassignment whereas the latter does not.  Mutable variables are a common - and indeed, essential - feature in imperative languages (Java, C++, Ruby, etc).  For example, here’s how we would sum an array of integers in Java:

int[] numbers = ...
 
int sum = 0;
for (int n : numbers) {
    sum += n;   // reassign sum to accumulate n
}

In this case, sum is a mutable variable which accumulates the total value of all numbers in the array, numbers.  Theoretical disputes aside, this style of programming is simply impossible in certain languages.  For example, SML provides no mechanism for declaring mutable variables.  So if we want to sum the values in a list of ints, we have to do it in some other way (code provided for the curious):

fun sumList ls = foldl (op +) 0 ls

In Scala, both of these techniques are available to us.  However, despite providing for mutable state, Scala does encourage developers to avoid it.  The reasoning behind this is that mutable state has a tendency to make code more difficult to reason about, making testing much harder.  Also, as any experienced developer will tell you, mutable state kills concurrency.

Not only does Scala encourage the use of immutable variables, it also encourages the use of immutable collections.  This concept may seem a little bizarre to those of you coming from an imperative background (I know it did to me), but it actually works.  While a map container which cannot be modified probably seems a little useless, it actually provides a startling degree of freedom from concerns like concurrent locking and unintended side-effects.  With immutable collections, you are able to manipulate objects with perfect assuredness that no method call will “accidentally” alter its contents.  How many times have you cloned a collection prior to returning it?  Or how often have you dug through someone else’s code just to assure yourself that it is safe to call a given method passing an instance of List?  Immutable data structures completely solve that problem.

Naturally, immutable collections require very different patterns and idioms than those which are mutable.  My ML code from above illustrates this to a very small degree, using a fold to traverse an immutable list.  As a general rule, immutable data structures are only useful when being passed around via method calls.  A common pattern is to build up a data structure recursively, creating a new instance with one more element at each invocation:

def toSet[T](list: List[T]) = {
  def traverse(list: List[T])(set: Set[T]): Set[T] = list match {
    case hd :: tail => traverse(tail)(set + hd)   // create a new Set, adding hd
    case Nil => set
  }
 
  traverse(list)(Set[T]())
}
 
val names = List("Daniel", "Chris", "Joseph", "Renee")
val nameSet = toSet(names)

At each recursive invocation of traverse, a new Set is created, based on the contents of the old one, set, but with one additional member, hd.  At the same time, we are deconstructing an immutable List, list, selecting its first element and whatever remains at each step.  Whenever you work with immutable data structures, you will see a lot of code which looks like this.

Of course, the natural question which comes to mind is, what about performance?  If each invocation actually creates a brand new Set for every recursive call, doesn’t that require a lot of inefficient object copying and heap operations?  Well, as it turns out, this isn’t really the case.  Yes, a new instance must be created at each turn, which is is a comparatively expensive operation on the JVM, but almost nothing is being copied.  All of Scala’s immutable data structures have a property known as persistence, which means that you don’t copy the data out of the old container when creating a new, you just have the new reference the old and treat all of its contents as if they were its own.  A linked list is a good example of this, since each node of a list contains exactly one element, as well as a reference to another node.  If we think of each node as a representative of a list starting with itself and traversing to the end, then list suddenly becomes a fully persistent data structure (since the new list contains a sub-list in its entirety and additive operations require no data copying).  Rich Hickey, the creator of Clojure (a Lisp dialect running on the JVM), has an excellent presentation which explains some of the hows and whys behind this technique (as well as some other interesting topics).  Chapter 19 of Programming in Scala (Odersky, Spoon & Venners) also has a good example of a persistent immutable queue.

Conclusion

I happen to like immutable data, so most of this series uses the scala.collection.immutable package.  However, there are certainly situations where mutable data structures are the only way to go, either because of system requirements or for performance reasons.  Fortunately, Scala’s mutable collections have an almost identical interface to its immutable collections.  Thus, most of the in formation presented here is applicable to both branches of the library.

Now that we have laid a basic foundation regarding the fundamentals of Scala’s collections framework, we can move onto more interesting things.  The next installment will deal with fold and map, the bread and butter of every collection.

Bencode Stream Parsing in Java

15
Jul
2008

It’s surprising how universal XML has become.  It doesn’t seem to matter what the problem, XML is the solution.  For example, consider a simple client/server architecture where the communication protocol must transmit some sort of structured data.  Nine developers out of ten will form the basis of the protocol around XML.  If it’s a lot of data to be transferred, then they will compress the XML using Java’s stream compression libraries.  If there’s binary data to be transmitted, it will either be stored as CDATA within the XML or as files within the same compressed archive.  Very few developers will actually stop and consider alternative solutions.

One such “alternative solution” is bencode (pronounced “bee-encode”).  Similar to formats like XML and JSON, bencode defines a series of constructs which may be used to encode arbitrarily complex data.  However, unlike XML, the design focus of the format was not to produce verbose, human-readable documents, but rather to encode data in the most concise manner possible.  To that end, the core bencode specification only includes four data types, two simple and two composite structures.  These types are defined with an almost complete absence of meta, requiring very little “structure” to clutter the data stream.

Unfortunately, outside of applications like BitTorrent, this elegant binary format has seen remarkably little adoption.  Because of this state of affairs, it can be extremely difficult to find libraries to actually process bencode data.  Not too long ago, I ran into a production use-case which required both parsing and generation of bencode-formatted files.  I considered digging into the source code for Vuze (nee “Azureus”), but a) it seemed like a lot of boring, nearly-wasted effort, and b) I strongly suspect that their bencode parser and generator are extremely space inefficient, since the data sources which they deal with are remarkably small.

The second hang-up was really a more significant motivator than the first, due to the fact that I knew I would be dealing with bencode streams potentially gigabytes in size.  So, rather than fruitlessly dig through someone else’s code, I decided to put all of this formal parser theory to work and roll my own library.  Unless you’re already familiar with bencode, I suggest you read the Wikipedia article to get a feel for the format, otherwise some of what I will be talking about will make no sense at all.  :-)

The first thing I needed to do was build the generation half of the library.  I decided that it would be easier if I avoided trying to use the same backend framework classes with both the generator and the parser.  For example, there are actually two classes in the framework which contain the logic for handling an integer: IntegerValue and IntegerType.  The former is for use in the parser, while the latter is for use in the generator.  This separation of logic may seem a little strange, but it actually simplifies things tremendously.

Remember my primary requirement: extremely efficient implementation of both generator and parser, especially with respect to space.  If I attempted to use the same classes to represent data for both the parser and the generator, then the parser would be forced to read the entire stream into some sort of in-memory representation (think about it; it’s actually true).  Obviously, this is unacceptable for streams that are gigabytes in size, so the traditional “good design” from an object-oriented standpoint was out.

Stream Generation

Since I needed the functionality of bencode stream generation before I needed parsing, I started with that aspect of the framework.  Here again, the most obvious “object-oriented” approach would have been the wrong one.  When we think of generating output in a structured format programmatically, we naturally imagine a DOM-like tree representation (preferably framework-agnostic) which is then walked by the framework to produce the output.  The major disadvantage to this approach is that it requires paging everything into memory.  This works for smaller applications or situations where the data is already in memory, but for my particular use-case, it would have been disastrous.

The only way to avoid paging everything into memory for stream generation is to structure the API so that the data is “pulled” by the generator, rather than “pushed” to it in tree-form.  In other words, the data itself has to be lazy-loaded, using callbacks to grab the data as-needed and hold it in memory only as long as is absolutely necessary.  In a functional language, this would be done with closures (or even normal data types in a pure-functional language).  However, as we all know, Java does not support such time-saving features.  The only recourse is to use abstract classes and interfaces which can be overridden in anonymous inner-classes as well as top-level classes as necessary.

image

After a bit of experimentation, the finalized hierarchy looks something like this.  Logically, every type must be able to query its abstract method for data of a certain Java type (long for IntegerType, InputStream for StringType, etc), convert this data into bencode with the appropriate meta, and then write the result to a given OutputStream.  Also following our nose, we see the semantic differences between composite and primitive types are really quite limited, especially if we simplify everything to a black box “get data / write encoding” methodology.  In fact, the only thing that CompositeType actually does is enforce the prefix/suffix encoding of every composite type.  Since this is in compliance with the bencode specification, we are safe in extracting this functionality into a superclass.

The more interesting distinction is between so-called “variant” and “invariant” types.  This is where you should begin to notice that I have over-engineered this library to some degree.  If I was just trying to create a pure bencode generator, then I could have skipped InvariantPrimitiveType and VariantPrimitiveType and just let IntegerType and StringType extend PrimitiveType directly.  This comes back to my initial requirements.

Priority one was to create a framework which was blazingly fast, but priority two was to ensure that it was extensible at the type level.  For the particular application I was interested in, I required more than just the core bencode types.  Also on the agenda were proper UTF-8 strings, dates, and support for null.  To accommodate all of this without too much code duplication, I knew I would have to extract a lot of the functionality into generic superclasses.  Hence my somewhat incorrect use of the terms “variant” and “invariant” to describe the difference between the integer type - which is prefix/suffix delimited - and the string type - which defines a length as its prefix and has no closing suffix.

Anyway, back to the problem at hand.  In addition to the CompositeType and PrimitiveType, you should also notice EntryType.  This “extra” type exists to handle the fact that bencode dictionaries are extremely weird and sit rather outside the “common functionality” umbrella of the format in general.  For one thing, the specification requires that dictionary entries be sorted by key, obviously implying some sort of Comparable relation.  Moreover, these keys must be themselves strings, but StringType isn’t comparable because its writeValue(OutputStream) method doesn’t return the data in question, but merely writes it to a given OutputStream.  Are we starting to see the problems with space-efficient implementations?

Enough babble though, let’s see some code!  Here’s how we might encode some very simple data using the generator framework:

public class GeneratorTest {
    public static void main(String[] args) {
        ByteArrayOutputStream os = new ByteArrayOutputStream();
        final byte[] picture = new byte[0];        // presumably something interesting
 
        DictionaryType root = new DictionaryType() {
            @Override
            protected void populate(SortedSet<EntryType<?>> entries) {
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("name"), 
                        new LiteralStringType("Arthur Dent")));
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("number"), 
                        new IntegerType(42)));
 
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("picture"), 
                        new StringType() {
 
                    @Override
                    protected long getLength() {
                        return picture.length;
                    }
 
                    @Override
                    protected void writeValue(OutputStream os) throws IOException {
                        os.write(picture);
                    }
                }));
 
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("planets"), 
                        new ListType() {
 
                    @Override
                    protected void populate(ListTypeStream list) throws IOException {
                        list.add(new LiteralStringType("Earth"));
                        list.add(new LiteralStringType("Somewhere else"));
                        list.add(new LiteralStringType("Old Earth"));
                    }
                }));
            }
        };
 
        try {
            root.write(os);
        } catch (IOException e) {
            e.printStackTrace();
        }
 
        System.out.println(new String(os.toByteArray()));
    }
 
    private static class LiteralStringType extends StringType 
            implements Comparable<LiteralStringType> {
        private final String value;
 
        public LiteralStringType(String value) {
            this.value = value;
        }
 
        @Override
        protected long getLength() {
            return value.length();
        }
 
        @Override
        protected void writeValue(OutputStream os) throws IOException {
            os.write(value.getBytes("US-ASCII"));
        }
 
        public int compareTo(LiteralStringType o) {
            return o.value.compareTo(value);
        }
    }
}

It’s hard to imagine why some people claim that Java is a verbose language…

The API may seem a little clumsy, but most of that is caused by the conniptions required to make the generator lazily pull the data, rather than paging it all into memory ahead of time.  Throwing that aside, the rest of the verbosity seems to come from the need for LiteralStringType, rather than just having a StringType which could handle this for us.  The reason for this extra headache is shown in the population of the “picture” field, which presumably may contain several megabytes worth of data from some external source such as a file or database (in this case of course, it doesn’t contain anything, but that’s besides the point).

The result of the above is as follows:

d4:name11:Arthur Dent6:numberi42e7:picture0:7:planetsl5:Earth14:Somewhere else9:Old Earthee

Or, with a little formatting to make it more palatable:

d
  4:name
  11:Arthur Dent

  6:number
  i42e

  7:picture
  0:

  7:planets
  l
    5:Earth
    14:Somewhere else
    9:Old Earth
  e
e

Technically, this is no longer valid bencode, but it is much easier to read this way.

The Parser

With all this bustle surrounding the generator, it’s easy to forget about the inverse process: parsing.  As it turns out, this is both easier and far less elegant than the solution for the generator (I know, it’s a sad state of affairs when the above is considered “elegant”).  Here again, there was a need for the parser to be extremely efficient, especially in terms of memory.  Thus, the logical approach of simply parsing the stream into an in-memory tree doesn’t really work.  Instead, the parser must be a so-called “pull parser”, which only parses each token upon request.  The parser only does exactly what work you ask of it, nothing more.

My initial designs for the parser attempted to follow the example set by the generator: each value type self-contained, responsible for parsing its own format.  As it turns out, this can be difficult to accomplish.  I could have expanded slightly on the parser combinator concept, but monads are very clumsy to achieve in Java, which led me to rule out that option.  In the end, I took a middle ground.

Click for full size

As before, a common superinterface sits above the entire representative hierarchy.  To understand this hierarchy a little better, perhaps it would be helpful to look at the full source for Value:

public interface Value<T> {
    public T resolve() throws IOException;
    public boolean isResolved();
}

The resolve() method is really the core of the entire parser.  The concept is that each value will be able to consume the bytes necessary to determine its own value, which is converted and returned.  This is extremely convenient because it enables VariantValue(s) (such as string) to carry the logic for parsing to a specific length, rather than the conventional e terminator.  In order to avoid clogging up memory, the return value of resolve() should not be memoized (though, there is nothing in the framework to prevent it).  Conventionally, values which are already resolved should throw an exception if they are resolved a second time.  This prevents the framework from holding onto values which are no longer needed.

You will also notice that CompositeValue not only inherits from Value, but also from the JDK interface, Iterable.  Logically, a composite value is a linear collection of values, consumed one at a time.  To me, that sounds a lot like a unidirectional iterator.  We can, of course, resolve the entire composite at once, mindlessly consuming all of its values, but since all of the values are lost once consumed, the only purpose for such an action would be if we know that we don’t care about a particular composite and we just want to rapidly skip to the next value in the stream.

Returning to primitive values, the resolve() method for IntegerValue is worthy of note, not so much for its uniqueness, but because it is very similar to the parsing technique used in all the other values:

public Long resolve() throws IOException {
    if (resolved) {
        throw new IOException("Value already resolved");
    }
    resolved = true;
 
    boolean negative = false;
    long value = 0;
 
    int b = 0;
    while ((b = is.read()) >= 0) {
        int digit = b - '0';
 
        if (digit < 0 || digit > 9) {
            if (b == '-') {
                negative = true;
            } else if (b == 'e') {
                break;
            } else {
                throw new IOException("Unexpected character in integer value: " 
                    + Character.forDigit(b, 10));
            }
        } else {
            value = (value * 10) + digit;
        }
    }
 
    if (negative) {
        value *= -1;
    }
 
    return value;
}

The i prefix itself is consumed before control flow even enters this method.  This is because the prefix is required to determine the appropriate value implementation to use.  Specifically, the logic to perform this determination is contained within the Parser class, which maintains a map of Value(s) and their associated prefixes.  String values have special logic associated with them, as they do not have a prefix.

As with most hand-coded parsers, this one operates on the principle of “eat until it hurts”.  We start out by assuming that the integer value extends to the end of the stream, then we set about to find a premature end to the integer, at which point we break out and call it a day.  Since we are moving from left to right through a base-10 integer, we must multiply the current accumulator by 10 prior to adding the new digit. 

Actually, the real heart of the parser framework is CompositeValue.  This class is inherited by Parser to define a special value encompassing the stream itself (which is viewed as a composite value with no delimiters and only a single child).  This unification allows us to keep the code for parsing a composite stream in a single location.  This implementation is a little less concise than the code for parsing an integer, but it follows the same pattern and is fairly instructive:

protected final Value<?> parse() throws IOException {
    if (resolved) {
        throw new IOException("Composite value already resolved");
    }
 
    if (previous != null) {
        if (!previous.isResolved()) {
            previous.resolve();        // ensure we're at the right spot in the stream
        }
    }
 
    byte b = -1;
    if (readAhead instanceof Some) {
        b = readAhead.value();
        readAhead = new None<Byte>();
    } else {
        b = read();
    }
 
    if (b >= 0) {
        Class<? extends Value<?>> valueType = parser.getValueType(b);
 
        if (valueType != null) {
            return previous = Parser.createValue(valueType, parser, is);
        } else if (b > '0' && b <= '9') {
            return previous = readString(b - '0');
        } else if (b == ' ' || b == '\n' || b == '\r' || b == '\t') {
            return parse();        // loop state
        } else {
            throw new IOException("Unexpected character in the parse stream: " 
                + Character.forDigit(b, 10));
        }
    }
 
    throw new IOException("Unexpected end of stream in composite value");
}
 
private final StringValue readString(long length) throws IOException {
    int i = is.read();
 
    if (i >= 0) {
        byte b = (byte) i;
 
        if (b == ':') {
            return Parser.createValue(StringValue.class, parser, 
                new SubStream(is, length));
        } else if (b >= '0' && b <= '9') {
            return readString((length * 10) + b - '0');
        } else {
            throw new IOException("Unexpected character in string value: " 
                + Character.forDigit(i, 10));
        }
    }
 
    throw new IOException("Unexpected end of stream in string value");
}

It seems a bit imposing, but really this code is more of the same logic we saw previously when dealing with integers.  The only value type which really gives us trouble here is string.  We can’t simply treat it like the others because it has no prefix.  For this reason, we must assume that any unbound integer is an inclusive prefix for a string.  In most parser implementations, this would require backtracking, but because we are doing this by hand, we can condense the backtrack into an inherited parameter (borrowing terminology from attribute grammars), avoiding the performance hit.

There’s one final bit of weirdness which deserves attention before we bail on this small epic: dictionary values.  Intuitively, a dictionary value should be parsed into a Java Map, or some sort of associative data structure.  Unfortunately, a map is by definition a random access data structure.  Since we are dealing with a sequential bencode stream, the only recourse to satisfy this property would be to page the entire dictionary into memory.  This of course violates one of the primary requirements which is to avoid using more memory than necessary.

The solution I eventually chose to this problem was to limit dictionary access to sequential, which translates into alphabetical given the nature of bencode dictionaries.  Thus, a dictionary can be parsed in the same way as a list, where each element is a sequential key and value, jointly represented by EntryValue.  To make usage patterns slightly easier, EntryValue memoizes the key and value.  Due to the fact that both of these objects are themselves Value(s), this does not lead to inadvertent memory bloat.

Conclusion

Hopefully the parser and generator presented here will be of some utility in situations where you have to parse large volumes of bencoded data.  The API is (admittedly) bizarre and difficult to deal with, but the performance results are difficult to deny.  This framework is currently deployed in production, where benchmarks have shown that it imposes little-to-no runtime overhead, and practically zero memory overhead (despite the sizeable amounts of data being processed).

For convenience, I actually created a Google Code project for this framework so as to facilitate its development internally to the project I was working on.  The end result of this is unlike most of my experiments, there is actually a proper SVN from which the source may be obtained!  A packaged JAR may be obtained from the downloads section.

JRuby Interop DSL in Scala

24
Mar
2008

JRuby is an amazing bit of programming.  It has managed to rise from its humble beginnings as a hobby project on SourceForge to the most viable third-party Ruby implementation currently available.  As far as I am aware, JRuby is the only Ruby implementation other than MRI which is capable of running an unmodified Rails application.  But JRuby’s innovation is not just limited to a rock-solid Ruby interpreter, it also provides tight integration between Java and Ruby.

There’s a lot of material out there on how to replace Java with Ruby “glue code” in your application.  The so-called “polyglot programming” technique states that we should embrace multiplicity of language in our applications.  Java may be very suitable for the core business logic of the application, but for actually driving the frontend UI, we may want to use something more expressive (like Ruby).  JRuby provides some powerful constructs which allow access to Java classes from within any Ruby application.  For example:

require 'java'
 
JFrame = javax.swing.JFrame
JButton = javax.swing.JButton
JLabel = javax.swing.JLabel
 
BorderLayout = java.awt.BorderLayout
 
class MainWindow < JFrame
  def initialize
    super 'My Test Window'
 
    setSize(300, 200)
    setDefaultCloseOperation EXIT_ON_CLOSE
 
    label = JLabel.new('You pushed the button', JLabel::CENTER)
      label.visible = false
    add label
 
    button = JButton.new 'Push Me'
    button.add_action_listener do
      label.visible = true
    end
    add(button, BorderLayout::SOUTH)
  end
end
 
window = MainWindow.new
window.visible = true

sshot-1  sshot-2 

Not a terribly complex example, but it illustrates some of the major advantages of JRuby.  Notice how clean and concise this code is.  It wouldn’t have been much longer had I done this using Java, but it would certainly have been less readable.  Ruby is absolutely perfect for this sort of use case (driving a UI).

As I said though, there are a myriad of examples showing this sort of thing.  As such, it’s not a very interesting topic for a posting.  What the masses have failed to cover, however, is how to accomplish the opposite: calling from Java into Ruby.

Likely the reason this topic has received less attention is because Java is the language will the veritable zoo of libraries and frameworks.  The amount of effort and research that has been put into Java simply dwarfs the comparative immaturity of the Ruby offerings.  Given the disparity, why would you even want to call into Ruby from Java?  This conclusion seems logical until one remembers that almost any application which uses Ruby for the frontend must actually pass flow control to Ruby at some point.  This means calling some sort of Ruby code.

The Java Way

There is some information available on the JRuby Wiki.  The wiki article really should include the caveat that “some experimentation may be required.”  Sufficient information is available, but it is neither intuitive nor convenient.  From Java, the syntax for executing an arbitrary Ruby statement looks like this:

ScriptEngineManager m = new ScriptEngineManager();
ScriptEngine rubyEngine = m.getEngineByName("jruby");
ScriptContext context = engine.getContext();
 
context.setAttribute("label", new Integer(4), ScriptContext.ENGINE_SCOPE);
 
try {
    rubyEngine.eval("puts 2 + $label", context);
} catch (ScriptException e) {
    e.printStackTrace();
}

It’s a typical Java API: over-bloated, over-designed and over-generic.  What would be really nice is to have a syntax for accessing Ruby objects that is as seamless as accessing Java from Ruby.  I want to be able to call Ruby methods and use Ruby classes with the same ease that I can use Java methods and classes.  In short, I want an internal DSL for Ruby.

Unfortunately, Java is a bit constrained in this regard.  Java’s syntax is extremely rigid and does not lend itself well to DSL construction.  It’s certainly possible, but the result is usually less than satisfactory.  We could certainly construct an API around the the Java Scripting API (JSR-233) which provides more high-level access (such as direct method calls and object wrappers), but it would be clunky and only a marginal improvement over the original.

The good news is there’s another language tightly integrated with Java that has a far more flexible syntax.  Rather than building our JSR-233 wrapper in Java, we can avail ourselves of Scala’s power and flexibility, hopefully arriving at a DSL which approaches native “feel” in its syntax.

The Scala Way

Since we’re attempting to construct a tightly-integrated API for language calls, the most effective route would be to apply techniques already discussed in the context of DSL design.  As always, we start with the syntax and allow it to drive the implementation:

// syntax.scala
import com.codecommit.scalaruby._
 
object Main extends Application with JRuby {
  require("test")
 
  associate('Person)(new Person(""))
 
  println("Received from multiply: " + 'multiply(123, 23))
  println("Functional test: " + funcTest('test_string))
 
  val me = new Person("Daniel Spiewak")
  println("Name1: " + me.name)
  println("Name2: " + (me->'name)())
 
  me.name = "Daniel"
  println("New Name: " + me.name)
 
  println("Person#toString(): " + me)
 
  val otherPerson = 'create_person().asInstanceOf[AnyRef]
  println("create_person type: " + otherPerson.getClass())
  println("create_person value: " + otherPerson.send[String]("name")())
 
  eval("puts 'Ruby integration is amazing'")
 
  def funcTest(fun:(Any*)=>String) = fun()
}
 
class Person(name:String) extends RubyClass('Person, name) {
  def name = send[String]("name")()
  def name_=(n:String) = 'name = n
}

And the associated Ruby code:

# test.rb
class Person
  attr_reader :name
  attr_writer :name
 
  def initialize(name)
    @name = name
  end
 
  def to_s
    "Person: {name=#{name}}"
  end
end
 
def test_string
  'Daniel Spiewak'
end
 
def multiply(a, b)
  a * b
end
 
def create_person
  Person.new 'Test Person'
end

Obviously we’re going to need some heavy implicit type conversions.  The important thing to note is that we don’t see any residue of the Java Scripting API, it’s all been encapsulated by our DSL.  We’ve taken an API which is oriented around single-call, low-level invocations and created a high-level wrapper framework which allows method calls, instantiation and even some form of type-checking.

Starting from the top, we see a call which should be familiar to Rubyists, the require statement.  In our framework, this method call is just a bit of syntactic sugar around a call to eval(String).  This semantics are basically the same as within Ruby directly, with the exception of how Ruby source files are resolved.  Any script file on the CLASSPATH is fair game, in addition to the normal Ruby locations.  This allows us to easily embed Ruby scripts within application JARs, libraries and other Java distributables.

Moving down a bit further, we find a somewhat mysterious call to the curried associate(Symbol)(RubyObject) method.  The purpose of this invocation will become more apparent later on.  Suffice it to say that this step is necessary to allow Scala class wrappers around existing Ruby classes.

On the next line of interest, we see for the first time how the framework allows for seamless Ruby method invocation.  Unlike Ruby, Scala doesn’t allow us to simply handle calls to non-existent methods.  Because of this limitation, we have to be a bit more clever in how we structure the syntax.  In this case, we use Scala symbols to represent the method.  There doesn’t seem to be a terribly good explanation of symbols in Scala, but there’s plenty of information regarding how they work in Ruby.  Since the concepts are virtually identical, techniques are cross-applicable.

The key to the whole “symbols as methods” idea is implicit type conversion.  The JRuby trait inherits a set of conversions which look something like this:

implicit def sym2Method[R](sym:Symbol):(Any*)=>R = send[R](sym2string(sym))
implicit def sym2MethodAssign[R](sym:Symbol) = new SpecialMethodAssign[R](sym)
 
private[scalaruby] class SpecialMethodAssign[R](sym:Symbol) {
  def intern_=(param:Any) = new RubyMethod[R](str2sym(sym2string(sym) + "="))(param)
}

Though we haven’t looked at it yet, it is possible to infer the purpose of the send(String) method.  It’s function is to prepare a call to a Ruby method without actually invoking it.  This distinction allows us to pass Ruby methods around as method parameters, just like standard Scala methods.  The method returned is actually an instance of class RubyMethod[R] (where R is the return type).  Scala allows classes to extend structural types like methods, allowing us to redefine the method invocation semantics for wrapped Ruby calls.

class RubyMethod[R](method:Symbol) extends ((Any*)=>R) {
  import JRuby.engine
 
  override def apply(params:Any*) = call(params.toArray)
 
  private[scalaruby] def call(params:Array[Any]):R = {
    val context = engine.getContext()
    val plist = new Array[String](params.length)
 
    for (i <- 0 until params.length) {
      plist(i) = "res" + i
      context.setAttribute(plist(i), JRuby.resolveValue(params(i)), ScriptContext.ENGINE_SCOPE)
      plist(i) = "$" + plist(i)
    }
 
    evaluate(() => if (plist.length > 0) {
      sym2string(method) + "(" + plist.reduceLeft[String](_ + ", " + _) + ")"
    } else {
      sym2string(method) + "()"
    })
  }
 
  protected def evaluate(invoke:()=>String):R = {
    val toRun = invoke()
    Logger.getLogger("com.codecommit.scalaruby").info(toRun)
 
    JRuby.handleExcept(JRuby.wrapValue[R](engine, engine.eval(toRun, engine.getContext())))
  }
}

The gist of this code is simply to assign every parameter value to an attribute in the Ruby runtime.  Attributes of ENGINE_SCOPE (as defined by JSR-233) are represented as global variables within Ruby.  These variables are named sequentially starting from zero.  (e.g. $res0, $res1, …)  As you can imagine, this technique tends to be a bit of a concurrency killer.  To keep things simple, I decided to completely ignore the issues associated with asynchronous execution.  It is certainly possible to adapt the framework to function in a multi-threaded environment, but I didn’t bother to do it.  (one of the perks of blogging is a license to extreme laziness)

Once these parameters are assigned, the method call is evaluated within the context of the runtime.  This is done by literally generating the corresponding Ruby code (done in the anonymous method) and then wrapping the return value in an instance of RubyObject (if necessary).  Note that the send(String) method does not actually kick-start this invocation process at all.  Rather, it creates an instance of RubyMethod[R] which corresponds to the method name.  This class extends (Any*)=>R, so it may be used in the normal “method fashion” - by appending parentheses which enclose parameters (if any).

Supporting Cast

At this point, it’s worth taking a moment to examine the specifics of the framework class hierarchy.  A number of classes exist to wrap around Ruby objects and methods.  We’ve already seen a few of them (RubyMethod[R] and RubyObject), but it’s worth going into more detail as to their purpose and relation to one another. 

Note that these class names often conflict with existing classes in the JRuby implementation.  This odd coincidence is precipitated by the fact that the framework seems to deal with a lot of the same concepts as the JRuby runtime (go figure).  Rather than obfuscating my class naming to avoid conflict, I just assume that you will either make use of the enhanced Scala import feature (as I have in the implementation), or just avoid using the JRuby internal classes.

image

  • RubyObject - The root of the object hierarchy.  This abstract class is designed to encapsulate the core functionality of the generic object (roughly: send, -> and eval) as well as containing all of the implicit type conversions.  Most of the syntax-defining magic happens here (more on this later).
  • JRuby - This is the primary type interface between the developer and the framework.  Classes which wish to make use of Ruby integration must inherit from this trait.  This is where the Logger (for executed statements) is initialized and deactivated.  Within the corresponding object, all of the backend resources are managed.  This is where the actual ScriptEngineManager instance lives, as well as a set of utility methods to handle wrapping and unwrapping of framework-specific objects.
  • RubyWrapperObject - This implementation of RubyObject is designed to wrap around instances which already exist within the Ruby interpreter.  For example, if a Ruby method returns an instance of ActiveRecord::Base, it will be represented in Scala by a corresponding instance of RubyWrapperObject.  Note that objects which are equivalent in the Ruby interpreter are not guaranteed to be pointer-equivalent.  However, the equals(Object) method is well-defined within RubyObject, thus comparisons between RubyObject instances will return sane results.  The == method in Scala is defined in terms of equals(Object), so existing code will behave rationally.
  • RubyClass - With the exception of the JRuby trait, this is likely the only class within the framework which the developer will have to reference explicitly.  This class allows developer-defined Scala classes to wrap around existing Ruby classes, providing type-safe method calls and even extended functionality.  More on this feature later.
  • RubyMethod - We’ve already seen how this serves as a wrapper around calls to Ruby methods.  However, its default implementation assumes that the method is defined in the global namespace.  This is impractical for many method calls (such as dispatch on an object).
  • RubyInstanceMethod - This class solves the problem of object dispatch with RubyMethod.  All of the core functionality is identical to its superclass with the exception of the generated Ruby code.  Instead of just generating a method call passing parameters, this class will generate a method call on a given Ruby object.  Thus, this class depends upon RubyWrapperObject which maintains a reference to a corresponding Ruby instance.

Alternative Dispatch

Not every method call is made on the enclosing scope.  Sometimes it is necessary to call a method in an object to which you have a reference.  For example, a method may return an instance of a some Ruby class.  This instance will be automatically wrapped by a Scala instance of RubyWrappedObject.  Since this Scala class doesn’t actually define any methods which correspond to the Ruby class, it is necessary to once more utilize the “symbols as methods” trick in method dispatch.  There are two ways to call a method on an object like this: the send[R](String) method (where R is the return type), and the -> (arrow) operator.

Using the arrow operator is a lot like normal method calls, except with symbols instead of method names.  Just like dispatch on the enclosing scope, the call is converted into an instance of RubyMethod (actually, an instance of RubyInstanceMethod) which can then be used as a standard Scala method.  The difference between using arrow and dispatching on the enclosing scope is the syntax must be a little more contrived.

Parentheses have the second-highest priority of all the Scala operators (the dot operator (.) has the highest).  This means that if we simply “follow our nose” where the syntax is concerned, we will arrive at an order of invocation which leads to an undesirable result.  Consider the following sample:

val obj = 'create_person()
obj->'name()

The first call is a standard dispatch on the enclosing scope.  The second call is what is interesting to us.  Reading this line naturally (at least to old C/C++ programmers) we would arrive at the following sequence of events:

  1. Get a reference to the name method from the instance contained within obj
  2. Invoke the method, passing no parameters

Unfortunately, this is not how the compiler sees things.  Because parentheses bind tighter than the arrow operator, it actually resolves the expression in the following way:

  1. Get a reference to the name method contained within the enclosing scope
  2. Invoke the method, passing no parameters
  3. Invoke the -> method on the instance within obj, passing the result of name as a parameter

This is obviously not what we wanted.  Unfortunately, there’s no way to make the arrow operator bind tighter than parentheses.  This is a good thing from a language standpoint, but it causes problems for our syntax.

The solution is to enclose any “arrow dispatch” statement within parentheses so as to force the order of evaluation:

val obj = 'create_person()
(obj->'name)()

It looks a bit weird, but it’s the only way Scala will allow this to work.  This call now evaluates properly, calling the name method on the obj instance, passing no parameters.

There’s actually another problem associated with arrow dispatch in our DSL: Scala already has an implicit meaning for the arrow operator.  The following sample should look familiar to those of you who have worked with Scala in other applications:

val numbers = Map(1 -> "one", 2 -> "two", 3 -> "three")

By default, Scala defines the arrow operator as an alternative syntax for defining 2-tuples.  This is good for most things, but bad for us.  What we want is to define a new implicit type conversion which converts Any into a corresponding instance of RubyWrappedObject.  This would allow us to satisfy the syntax given above.  However, Scala’s 2-tuple syntax already defines an implicit type conversion for the Any type which deals with the arrow operator.  Rather than examining the context to attempt to disambiguate, the Scala compiler simply gives up and prints an error stating that the implicit type conversions are ambiguous.  This poses a bit of a problem and nearly killed the arrow operator idea in design.

The solution is actually to override Scala’s built-in conversion by defining our own conversion with the same name and signature but which provides us with the option of using our own arrow operator definition.  The behavior we want is to allow normal use of the arrow operator when dealing with Any -> Any, but convert to RubyWrappedObject and dispatch when dealing with Any -> Symbol.  After a little digging through the Scala standard library, I arrived at the following solution (defined in RubyObject):

implicit def any2ArrowAssoc[A](a:A) = new SpecialArrowAssoc(a)
 
private[scalaruby] class SpecialArrowAssoc[A](a:A) extends ArrowAssoc(a) {
  def ->(sym:Symbol) = (a match {
    case obj:RubyObject => obj
    case other => new RubyWrapperObject(other)
  })->sym
}

Notice that we extend Scala’s pre-existing ArrowAssoc[A] class (which handles the special 2-tuple syntax) and then overload the -> method to work differently with symbols.  This code now does precisely what we need.  By introducing this extra layer of indirection, as well as by overriding Scala’s existing conversion, we’re able to support the arrow syntax as shown in the above examples.

Sending Messages

There is one final form of dispatch which allows typed return values: send[R](String).  This is actually the method to which all the other dispatch forms delegate (as it is the most general).  This method is very similar to the Ruby send method which allows Smalltalk-style message passing on arbitrary objects.  The really important thing about this method though is that it will automatically cast the return value from the method to whatever type you specify, allowing you to define type-safe wrappers around existing Ruby methods in Scala:

def multiply(a:Int, b:Int) = send[Int]("multiply")(a, b)
 
val result:Int = multiply(123, 23)

send is effectively defined as a curried function since it takes a method name as a parameter and returns an instance of RubyMethod as a result.  This mimics the behavior of dispatch with symbol literals in that you can use send to generate type-safe partially-applied functions for corresponding Ruby methods.

Note that send could just as easily have taken a symbol as a parameter, rather than a string.  However, the metaphor throughout the DSL is “symbols as methods”, thus string was used to avoid logical conflict.  Scala itself was perfectly happy passing symbol literals around in addition to treating them as methods.

Class Wrapping

The final bit of code in the example now so far above us serves as a sample of how one might wrap an existing Ruby class within Scala.  Person is actually a class defined in Ruby (as you can see from the Ruby sources).  It has a read/write attribute, name, as well as an overridden to_s method.  RubyObject already contains the logic for handling calls to toString() and proxying them to Ruby’s to_s, but the name attribute must be handled explicitly in code.

The goal is basically to provide a type-safe wrapper around the Person Ruby class.  We could just as easily dispatch on the automatically wrapped instance of RubyWrappedObject using either syntax described above, but an explicit wrapper is a bit nicer.  The compiler can check things for us, and we can even add methods to the class (at least, as far as Scala is concerned) in true Ruby “open class” style.  All that is necessary to accomplish this wrapper is to extend RubyClass and to define the delegating wrapper methods:

class Person(name:String) extends RubyClass('Person, name) {
  def name = send[String]("name")()
  def name_=(n:String) = 'name = n
}

We specify which Ruby class we are wrapping as the first parameter in the constructor for RubyClass.  The parameters which follow are passed directly to the constructor of the corresponding Ruby class.  This Ruby constructor is invoked automatically, instantiating the corresponding wrapped Ruby object in the background.  Notice that we specify the name of the Ruby class using a symbol.  This is the one place in the framework that we break with the “symbols as methods” metaphor.  The consequence is a nice, clean syntax for Ruby class wrapping.  Unfortunately, it also means that wrapping a class within a non-included namespace (e.g. ActiveRecord::Base) can be a little clunky.  The only way to do it is to explicitly invoke the Symbol(String) constructor.  (this is required because Scala symbols can only contain alpha-numerics and underscores)

Once we have our wrapped class signature, it’s easy to define the delegate methods.  Scala encourages a blurring of field and method, similar to Ruby.  As such, it supports a very Ruby-esque syntax for accessor/mutator pairs.  This makes the wrapped syntax just a bit nicer.  For the accessor, we make a call to the send method, specifying the return type necessary for the wrapper.  The mutator allows us to be a bit more creative.

We don’t really need type-safe return values for a mutator.  We would normally just set the return type as Unit and ignore the result.  Thus we can once again use the symbol dispatch syntax.  Notice that this time we’re not directly treating a symbol as a method.  We’re apparently assigning a value to the symbol using the = operator (corresponds to the operator= assignment operator in C++).  This is possible through a separate implicit type conversion which generates a one-off utility instance:

private[scalaruby] class SpecialMethodAssign[R](sym:Symbol) {
  def intern_=(param:Any) = new RubyMethod[R](str2sym(sym2string(sym) + "="))(param)
}

As you can see, all this method does is generate a new symbol which includes the ‘=’ character and returns the result of dispatching on the corresponding Ruby method.  Note that mutators in Ruby are defined as “=“, thus appending “=” to the method name is the appropriate behavior.

Return Value Wrapping

There’s actually a slight problem involved in allowing Scala wrappers around existing Ruby types.  Well, not so much a problem as an inconsistency.  The problem is simply this: if a Ruby method creates an instance of a Ruby class for which there is a Scala wrapper and returns this value through the framework into Scala, one would expect this value would be wrapped into an instance of the Scala wrapper.  If you look in the example far above, there is an example of this in the create_person method.  The method creates an instance of Ruby class Person and returns it as a result.

Somehow, the framework must identify that there is a corresponding Scala wrapper and then properly create an instance.  This actually poses something of a dilemma in two ways.  Number one, Scala has no equivalent to Ruby’s ObjectSpace, so there’s no way to get a comprehensive list of all classes which have been defined.  Even if we could get this list, the corresponding Ruby class is specified in the constructor parameters to RubyClass, so there’s no way to obtain the information statically from outside the class.  Number two, we have to somehow create an instance of the Scala wrapper class without creating a corresponding instance of the wrapped Ruby class (since we already have one).  This means we need some sort of override in the RubyClass constructor.

The best solution to all of these problems is to introduce the associate method.  The usage is demonstrated at the top of the example where we associate the Person Ruby class with the Person Scala wrapper class.  More specifically, we associate the Ruby class with a pass-by-name parameter which defines how to instantiate the Scala class.  This is an important distinction as it solves our second problem of instance creation.  The framework has no way of knowing what parameters must be passed to the Scala wrapper constructor, so the instantiation itself must be passed:

associate('Person)(new Person(""))

As I mentioned previously, this is a pass-by-name parameter which means that it will not be immediately evaluated, but rather on-demand somewhere in the body of associate.  The associate method actually takes this value and wraps it in an anonymous method which invokes the instantiation each time a value of Ruby type Person must be wrapped.  Just prior to invoking the constructor, an override is put in place within the RubyClass singleton object (not shown in the class hierarchy) to prevent the creation of a corresponding Ruby instance.  This is what allows the new instance of Scala class Person to correspond with an existing Ruby value.  Here again we’re sacrificing concurrency for a hacky work-around to a complex problem.  Any sort of “proper” implementation would have to solve this problem in a more elegant way.

It Never Ends!

This post, that is.  There’s so much more I could ramble on about (I never even talked about how exceptions are handled), but this entry is already far too long.  Hopefully the material presented here only serves to whet your appetite for slicker JRuby-Scala integration and all the benefits it can bring.  I’ve packaged up the framework presented here as a downloadable archive.  The package includes the Ruby engine for the Java Scripting API as well as a jar-complete build made from the JRuby SVN.  The project may work with JRuby 1.0, but I doubt it.  Anyway, JRuby 1.1 is due shortly, so why bother.  Remember that this is extremely untested and very experimental.  (I did warn you about the concurrency issues, right?)  If this is interesting to people, I may do a proper release into an OSS project somewhere.  For right now, I just don’t have the time.  :-(

I hope this entry gives you an idea of what’s involved in Scala DSL implementation, as well as an idea of where such a technique may be useful in your own projects.  After all, what would be better than everyone being able to write their own Rails-killer and define highly fluid APIs!

XMLBuilder: A Ruby DSL Case Study

10
Mar
2008

XML is probably the most ubiquitous and most recognizable format in the modern development landscape.  It’s simple power in representing hierarchically structured data has made it the standard for representing everything from Word documents to databases.  It’s also one of the most verbose and meta-rich syntaxes known to man.

So in that sense, XML is a mixed blessing.  Its flexibility and intuitive nature allows developers to store just about any data in a human readable, easy-to-debug manner.  Unfortunately, its verboseness often makes generating the actual XML a very frustrating and boring foray into the land of boiler-plate.  Various techniques have been developed over the years to smooth this process (e.g. manipulating a DOM tree or reflectively marshalling objects directly to XML), but on the whole, generating XML in code is just as annoying as it has always been.  We’ve all written code like this in the past:

public String toXML() {
    final String INDENT = "    ";
    StringBuilder back = new StringBuilder(
            "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n\n");
 
    back.append("<person name=\"").append(getName()).append("\">\n");
    for (Book book : getBooks()) {
        back.append(INDENT).append("<book>").append(book.getTitle()).append("</book>\n");
    }
    back.append("</person>");
 
    return back.toString();
}

Not the most pleasant of algorithms.  Oh there’s nothing complex or challenging about the code, it’s just annoying (as String manipulation often is).  Things would get a little more interesting if we actually had some sort of recursive hierarchy to traverse, but even then it would still be pretty straight-forward.  XML generation is tedious grudge-work to which we all must submit from time to time.

Domain Specific Languages

There’s a new wave in programming (especially in the communities surrounding dynamic languages like Ruby and Groovy) called “Domain Specific Language”, or DSL for short.  The DSL technique has really been around for a long time, but it’s just now finding its way to the mainstream, “Citizen Joe” developers.  This is because for decades, domain specific languages have been separate languages unto themselves, with their own syntax, quirks, and utter lack of tool support.

For example, if a company wanted to specify their business logic in a more readable format using a DSL, the would have to spend months, sometimes years of effort to build an entirely new language, just for the one application.  While there was no need to generate a truly flexible and extensible syntax (or even one which was Turing-complete), such efforts would still require an enormous amount of work to be put into trivialities like parsers, AST walkers and core libraries.  Obviously this meant that DSLs were extremely uncommon.  There are very few use-cases for something which requires so much and gains you so little.

This variety of domain specific language is called an “External DSL”.  This name stems from the fact that the language is completely independent and external to other languages.  The innovation which made the DSL attainable for the common-man is that of the “Internal DSL”.

People often struggle to precisely define internal DSLs.  In the broadest sense, an internal DSL is a language API carefully structured to satisfy a particular syntax when used in code.  There is no syntax parsing involved in implementing an internal DSL.  Rather, effort is focused on the API itself.  The more flexible the syntax of the underlying language, the more powerful and potentially intuitive the syntax of the DSL can be.  It is for this reason that languages such as Ruby, Python, Groovy and company are often used to implement internal DSLs.  These languages are defined by extremely flexible and dynamic syntax, lending themselves perfectly to such efforts.

With this in mind, it should theoretically be possible to design an “API” for XML generation that’s simple and intuitive.  The DSL could be implemented using Ruby, though actually sufficiently dynamic language could do.  Implementing an internal DSL is a notoriously difficult task, so perhaps a step-by-step walk through is in order.

Getting Started

The very first step in creating an internal DSL is to design the syntax.  Similar to how test-driven development starts with a set of unit tests and builds functionality which satisfies the tests, DSL development starts with a few code samples and builds an API to satisfy the syntax.  This step will guide all of the code we write as we implement the DSL.

One of the primary goals for our DSL syntax should be to reflect (as much as possible) the structure and essence of the XML output in the generating code.  One of the major shortcomings of the ad hoc “string concat” XML generation technique is an utter lack of logical structure in the code.  Sure your algorithm may be nicely organized and formatted, but does it really reflect the actual structure of the XML it’s generating?  Probably not.  Another major goal should be brevity.  XML generating code is extremely verbose, and the whole idea behind writing a DSL for XML generation is to elevate some of this hassle.

So, we’ve got brevity and logical structure.  As minor goals, we also may want to spend some effort making the syntax versatile.  We don’t want the algorithm to perform incorrectly if we try to generate a document with the < character in some field.  With that said however, we don't need to be overly concerned with flexibility.  Yes, our DSL should be as capable as possible, but not to the detriment of brevity and clarity.  These concerns are paramount, functionality can take a back seat.

With these goals in mind, let’s try writing a static syntax for our Person/Book example above. Bear in mind that every construct used in the syntax must be valid Ruby which the interpreter will swallow.  How it will swallow is unimportant right now.  For this step, it’s helpful to remember that the ruby -c command will perform a syntax check on a source file without actually attempting to run any sort of interpretation.

xml.person :name => 'Daniel "Leroy" Spiewak' do
  book do
    'Godel, Escher, Bach'
  end
  book do
    'The Elegant Universe'
  end
  book do
    'Fabric of the Cosmos'
  end
  book do
    "The Hitchhiker's Guide to the Galaxy"
  end
  book    # book with no title
end
 
puts xml  # prints the generated XML

Note that this is all static, there’s no dynamically generated data to muddle the picture.  We can worry about that later.

This code sample gives us a rough idea of what the syntax should look like.  According to ruby -c, it’s valid syntax, so we’re ready to proceed.  Everything looks fairly unencumbered by meta-syntax, and the logical structure is certainly represented.  Now we do a quick mental pass through the syntax to ensure that all of our functional bases are covered.  In the example we’ve used attributes, nested elements and text data.  One of our attributes is making use of illegal XML characters (double quotes).  We haven’t tried mixing elements and text yet, but it’s fairly obvious how this could be done.

Strictly speaking, we haven’t rigorously defined anything.  What we have done is given ourselves a framework to build off of.  This sample will serve as an invaluable template to guide our coding.

Parsing the Syntax

Remember I said that internal DSLs do not involve any syntax parsing?  Well, in a way I was wrong.  The next thing we need to do is walk through the syntax ourselves and understand how Ruby will understand it.  This step requires some fairly in-depth knowledge of Ruby’s (or whatever language you’re using) syntax.  Hopefully the following diagram will help to clarify the process.

image1

I’ve annotated the areas of interest in cute, candy colors to try and illustrate roughly what the Ruby syntax parser will see when it looks at this code.  It’s important to understand this information since it is the key to translating our contrived syntax into a rigorous API.

For the sake of this discussion, I’m going to assume you have a working knowledge of Ruby and have already recognized some of the more basic syntax elements within the sample.  For example, the do/end block is just that, a Block object (roughly the Ruby equivalent of a closure).  The annotated String literal is an implicit return value for the inner block (the last statement of a block is implicitly the return value).  This syntax employs a Ruby “trick” which allows us to drop the return statement (an important trick, since you can’t return from a block anyway).

Anyone familiar with Ruby on Rails should be aware of Ruby’s convenient Hash literal syntax employed for the element attributes.  As annotated, this literal is being passed as a method parameter (Ruby allows us to drop the parentheses).  Here again we’re using little oddities in the Ruby syntax to clean up our DSL and reduce unnecessary cruft.

You should also be familiar with the blurred line between variable and method so common in Ruby.  As annotated, the xml syntax element could be either a local field, or a method that we’re invoking without the parentheses.  To clear up this point, we need to refer to the full sample above.  In no place do we declare, nor do we provide for the declaration of a local variable xml.  Thus, xml() will be a method within our API available in the global scope.

Now we get to the really interesting stuff: those mysterious undefined methods.  As with other dynamic languages, Ruby allows method invocations upon non-existent methods.  To someone from a static language background, this seems rather peculiar, but it’s perfectly legal I assure you.  In a sense, it’s much like the “methods as messages” metaphor employed by Objective-C and Smalltalk.  When you call the method, you’re just broadcasting a message to the object, hoping someone answers the call.

In Ruby, when a method is called which has no corresponding handler, a special method called method_missing is invoked.  This method is what allows us to handle messages even without a pre-defined method body.  ActiveRecord makes extensive use of this feature.  Every time you access a database field, you’re calling a non-existent method and thus, method_missing.

So based on the first non-existent method we’re calling (person(…)), we can already say something about our API.  Somewhere in the global scope, we’re going to need to have a method called xml().  This method will return an instance of a class which defines method_missing and so can handle our person(…) invocation.  It seems this implementation of method_missing will also need to optionally take a Block as the final parameter, allowing it to pass execution on to its logical children.  When referring back to our original sample, the final line seems to indicate that the instance must implement the to_s() method, allowing puts() to implicitly convert it to a String.

Implementation

All that work and we still haven’t even written a solid line of code.  What we have done though is given ourselves a clear idea of where we need to go, and a rough outline for how it can be done.  We’ve laid the foundation for the implementation by giving ourselves two critical starting points: xml() and that mysterious outer-scoped method_missing.  We may not know how we’re going to implement all this yet, but we at least have an idea on where to begin.

Starting with the easy one, let’s implement a basic framework around xml().

require 'singleton'
 
module XML
  class XMLBuilder
    attr_reader :last
 
    private
    def initialize
    end
 
    def method_missing(sym, *args, &block)
      # ...
    end
 
    def to_s
      @last
    end
 
    def self.instance
      @@instance ||= new
    end
  end
end
 
def xml
  XML::XMLBuilder.instance
end

Notice how we’re using a singleton instance of XMLBuilder?  That’s because there’s no need for us to have more than one instance exposed to the DSL users.  XMLBuilder is just a placeholder class that dispatches the first level of commands for the DSL and will assemble the result for us as it executes.  Thus, XMLBuilder can be considered the root of our DSL, the corner-stone upon which the entire API is bootstrapped.  We do however need to allow for other, private instances as we’ll see later on.

Another item worthy of note in this snippet is the non-standard method_missing signature.  This is because we will actually need the block as a proper Proc object down the line.  A block parameter (prefixed with &) is the only parameter which can follow a varargs parameter (prefixed with *) and there can only be one of them.

We can now try a first-pass implementation of method_missing.  This implementation is really just a sample with a very significant shortcoming.  The actual implementation is quite a bit more complex.

def method_missing(sym, *args, &block)
  @last = "<#{sym.to_s}"
 
  if args.size > 0 and args[0].is_a? Hash  # never hurts to be sure
    args[0].each do |key, value|
      @last += " #{key.to_s}=\"#{value.to_s}\""
    end
  end
 
  if block.nil?
    @last += "/>"   # if there's no children, just close the tag
  else
    @last += ">"
 
    builder = XMLBuilder.new
    builder.instance_eval block
 
    @last += builder.last
 
    @last += "</#{sym.to_s}>"
  end
end

Again, this is just the rough idea.  In our actual implementation we need to be concerned about things like valid attribute values (as demonstrated in our sample), proper element names, etc.

The key to the whole algorithm is the instance_eval invocation.  This single statement passes control to the next block down and starts the process all over again.  The important thing about this is evaluating the block within the context of the an XMLBuilder instance, rather than just its enclosing context.  This allows the nested block to take advantage of the same method_missing implementaiton, hence implicitly recursing further into the XML tree.  This technique is extremely powerful and absolutely critical to a lot of DSL implementations.

You’ll also notice that this is breaking the singleton pattern we established in our class design.  This is because a separate instance of XMLBuilder is required to handle each nested block within the DSL tree.  It’s very important to remember that we’re not actually exposing this instance in the public API, it’s just a tool we use within the implementation.  The API users will still only see the singleton XMLBuilder instance.

So a bit more semantically, what we’re doing is the following:

  1. Handle the root non-existant method invocation
  2. Deal with any attributes in the single Hash parameter (if exists)
  3. Check for nested block. If found, create a new builder instance and use its context to evaluate the child block
  4. Within child block evaluation, recurse to step 1 for each method invocation
  5. Accumulate result from child block evaluation and return

Of course, this doesn’t deal with nested text content within elements.  However, the principles of the implementation are fairly clear and the rest is just code.

As with all internal DSLs, the user-friendly DSL syntax is supported by an API that’s ugly, hacky and heavily dependant on language quirks (such as the super-critical instance_eval).  In fact, it has been said that internal DSLs encourage the use of language quirks if it simplifies the API from the end-developer standpoint.  Of course this makes the end-developer code very clean and easy to maintain at the cost of the DSL developer’s code, which is nightmarish and horrible to work with.  It’s a tradeoff that must be considered when the decision is made to go with a DSL-style API.

Conclusion

Hopefully this was a worthwhile trek into the gory innards of implementing an internal DSL.  I leave you with one final code sample to whet your appetite for the fully-implemented API.  This is an extract from the ActiveObjects Ant build file converted to use our new DSL.  It’s interesting that this converted version is significantly cleaner than the original, XML form.

require 'xmlbuilder'
 
xml.project :name => 'ActiveObjects', :default => :build do
  dirname :property => 'activeobjects.dir', :file => '${ant.file.ActiveObjects}'
  property :file => '${activeobjects.dir}/build.properties'
 
  target :name => :init do
    mkdir :dir => '${activeobjects.dir}/bin'
  end
 
  target :name => :build, :depends => :init do
    javac :srcdir => '${activeobjects.dir}/src', :source => 1.5, :debug => true
  end
 
  target :name => :build_test, :depends => [:init, :build] do
    property :name => 'javadoc.intern.path', :value => '${activeobjects.dir}/${javadoc.path}'
  end
end
 
puts xml

This will print the following XML:

<?xml version="1.0" encoding="UTF-8"?>
 
<project name="ActiveObjects" default="build">
  <dirname property="activeobjects.dir" file="${ant.file.ActiveObjects}"/>
  <property file="${activeobjects.dir}/build.properties"></property>
  <target name="init">
    <mkdir dir="${activeobjects.dir}/bin" />
  </target>
  <target name="build" depends="init">
    <javac source="1.5" debug="yes" srcdir="${activeobjects.dir}/src"/>
  </target>
  <target name="build_test" depends="init,build">
    <property name="javadoc.intern.path" value="${activeobjects.dir}/${javadoc.path}"></property>
  </target>
</project>

The fully implemented DSL is available in a single Ruby file. Also linked are some examples to provide a more balanced view of the capabilities.