Skip to content

Bencode Stream Parsing in Java

15
Jul
2008

It’s surprising how universal XML has become.  It doesn’t seem to matter what the problem, XML is the solution.  For example, consider a simple client/server architecture where the communication protocol must transmit some sort of structured data.  Nine developers out of ten will form the basis of the protocol around XML.  If it’s a lot of data to be transferred, then they will compress the XML using Java’s stream compression libraries.  If there’s binary data to be transmitted, it will either be stored as CDATA within the XML or as files within the same compressed archive.  Very few developers will actually stop and consider alternative solutions.

One such “alternative solution” is bencode (pronounced “bee-encode”).  Similar to formats like XML and JSON, bencode defines a series of constructs which may be used to encode arbitrarily complex data.  However, unlike XML, the design focus of the format was not to produce verbose, human-readable documents, but rather to encode data in the most concise manner possible.  To that end, the core bencode specification only includes four data types, two simple and two composite structures.  These types are defined with an almost complete absence of meta, requiring very little “structure” to clutter the data stream.

Unfortunately, outside of applications like BitTorrent, this elegant binary format has seen remarkably little adoption.  Because of this state of affairs, it can be extremely difficult to find libraries to actually process bencode data.  Not too long ago, I ran into a production use-case which required both parsing and generation of bencode-formatted files.  I considered digging into the source code for Vuze (nee “Azureus”), but a) it seemed like a lot of boring, nearly-wasted effort, and b) I strongly suspect that their bencode parser and generator are extremely space inefficient, since the data sources which they deal with are remarkably small.

The second hang-up was really a more significant motivator than the first, due to the fact that I knew I would be dealing with bencode streams potentially gigabytes in size.  So, rather than fruitlessly dig through someone else’s code, I decided to put all of this formal parser theory to work and roll my own library.  Unless you’re already familiar with bencode, I suggest you read the Wikipedia article to get a feel for the format, otherwise some of what I will be talking about will make no sense at all.  :-)

The first thing I needed to do was build the generation half of the library.  I decided that it would be easier if I avoided trying to use the same backend framework classes with both the generator and the parser.  For example, there are actually two classes in the framework which contain the logic for handling an integer: IntegerValue and IntegerType.  The former is for use in the parser, while the latter is for use in the generator.  This separation of logic may seem a little strange, but it actually simplifies things tremendously.

Remember my primary requirement: extremely efficient implementation of both generator and parser, especially with respect to space.  If I attempted to use the same classes to represent data for both the parser and the generator, then the parser would be forced to read the entire stream into some sort of in-memory representation (think about it; it’s actually true).  Obviously, this is unacceptable for streams that are gigabytes in size, so the traditional “good design” from an object-oriented standpoint was out.

Stream Generation

Since I needed the functionality of bencode stream generation before I needed parsing, I started with that aspect of the framework.  Here again, the most obvious “object-oriented” approach would have been the wrong one.  When we think of generating output in a structured format programmatically, we naturally imagine a DOM-like tree representation (preferably framework-agnostic) which is then walked by the framework to produce the output.  The major disadvantage to this approach is that it requires paging everything into memory.  This works for smaller applications or situations where the data is already in memory, but for my particular use-case, it would have been disastrous.

The only way to avoid paging everything into memory for stream generation is to structure the API so that the data is “pulled” by the generator, rather than “pushed” to it in tree-form.  In other words, the data itself has to be lazy-loaded, using callbacks to grab the data as-needed and hold it in memory only as long as is absolutely necessary.  In a functional language, this would be done with closures (or even normal data types in a pure-functional language).  However, as we all know, Java does not support such time-saving features.  The only recourse is to use abstract classes and interfaces which can be overridden in anonymous inner-classes as well as top-level classes as necessary.

image

After a bit of experimentation, the finalized hierarchy looks something like this.  Logically, every type must be able to query its abstract method for data of a certain Java type (long for IntegerType, InputStream for StringType, etc), convert this data into bencode with the appropriate meta, and then write the result to a given OutputStream.  Also following our nose, we see the semantic differences between composite and primitive types are really quite limited, especially if we simplify everything to a black box “get data / write encoding” methodology.  In fact, the only thing that CompositeType actually does is enforce the prefix/suffix encoding of every composite type.  Since this is in compliance with the bencode specification, we are safe in extracting this functionality into a superclass.

The more interesting distinction is between so-called “variant” and “invariant” types.  This is where you should begin to notice that I have over-engineered this library to some degree.  If I was just trying to create a pure bencode generator, then I could have skipped InvariantPrimitiveType and VariantPrimitiveType and just let IntegerType and StringType extend PrimitiveType directly.  This comes back to my initial requirements.

Priority one was to create a framework which was blazingly fast, but priority two was to ensure that it was extensible at the type level.  For the particular application I was interested in, I required more than just the core bencode types.  Also on the agenda were proper UTF-8 strings, dates, and support for null.  To accommodate all of this without too much code duplication, I knew I would have to extract a lot of the functionality into generic superclasses.  Hence my somewhat incorrect use of the terms “variant” and “invariant” to describe the difference between the integer type - which is prefix/suffix delimited - and the string type - which defines a length as its prefix and has no closing suffix.

Anyway, back to the problem at hand.  In addition to the CompositeType and PrimitiveType, you should also notice EntryType.  This “extra” type exists to handle the fact that bencode dictionaries are extremely weird and sit rather outside the “common functionality” umbrella of the format in general.  For one thing, the specification requires that dictionary entries be sorted by key, obviously implying some sort of Comparable relation.  Moreover, these keys must be themselves strings, but StringType isn’t comparable because its writeValue(OutputStream) method doesn’t return the data in question, but merely writes it to a given OutputStream.  Are we starting to see the problems with space-efficient implementations?

Enough babble though, let’s see some code!  Here’s how we might encode some very simple data using the generator framework:

public class GeneratorTest {
    public static void main(String[] args) {
        ByteArrayOutputStream os = new ByteArrayOutputStream();
        final byte[] picture = new byte[0];        // presumably something interesting
 
        DictionaryType root = new DictionaryType() {
            @Override
            protected void populate(SortedSet<EntryType<?>> entries) {
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("name"), 
                        new LiteralStringType("Arthur Dent")));
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("number"), 
                        new IntegerType(42)));
 
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("picture"), 
                        new StringType() {
 
                    @Override
                    protected long getLength() {
                        return picture.length;
                    }
 
                    @Override
                    protected void writeValue(OutputStream os) throws IOException {
                        os.write(picture);
                    }
                }));
 
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("planets"), 
                        new ListType() {
 
                    @Override
                    protected void populate(ListTypeStream list) throws IOException {
                        list.add(new LiteralStringType("Earth"));
                        list.add(new LiteralStringType("Somewhere else"));
                        list.add(new LiteralStringType("Old Earth"));
                    }
                }));
            }
        };
 
        try {
            root.write(os);
        } catch (IOException e) {
            e.printStackTrace();
        }
 
        System.out.println(new String(os.toByteArray()));
    }
 
    private static class LiteralStringType extends StringType 
            implements Comparable<LiteralStringType> {
        private final String value;
 
        public LiteralStringType(String value) {
            this.value = value;
        }
 
        @Override
        protected long getLength() {
            return value.length();
        }
 
        @Override
        protected void writeValue(OutputStream os) throws IOException {
            os.write(value.getBytes("US-ASCII"));
        }
 
        public int compareTo(LiteralStringType o) {
            return o.value.compareTo(value);
        }
    }
}

It’s hard to imagine why some people claim that Java is a verbose language…

The API may seem a little clumsy, but most of that is caused by the conniptions required to make the generator lazily pull the data, rather than paging it all into memory ahead of time.  Throwing that aside, the rest of the verbosity seems to come from the need for LiteralStringType, rather than just having a StringType which could handle this for us.  The reason for this extra headache is shown in the population of the “picture” field, which presumably may contain several megabytes worth of data from some external source such as a file or database (in this case of course, it doesn’t contain anything, but that’s besides the point).

The result of the above is as follows:

d4:name11:Arthur Dent6:numberi42e7:picture0:7:planetsl5:Earth14:Somewhere else9:Old Earthee

Or, with a little formatting to make it more palatable:

d
  4:name
  11:Arthur Dent

  6:number
  i42e

  7:picture
  0:

  7:planets
  l
    5:Earth
    14:Somewhere else
    9:Old Earth
  e
e

Technically, this is no longer valid bencode, but it is much easier to read this way.

The Parser

With all this bustle surrounding the generator, it’s easy to forget about the inverse process: parsing.  As it turns out, this is both easier and far less elegant than the solution for the generator (I know, it’s a sad state of affairs when the above is considered “elegant”).  Here again, there was a need for the parser to be extremely efficient, especially in terms of memory.  Thus, the logical approach of simply parsing the stream into an in-memory tree doesn’t really work.  Instead, the parser must be a so-called “pull parser”, which only parses each token upon request.  The parser only does exactly what work you ask of it, nothing more.

My initial designs for the parser attempted to follow the example set by the generator: each value type self-contained, responsible for parsing its own format.  As it turns out, this can be difficult to accomplish.  I could have expanded slightly on the parser combinator concept, but monads are very clumsy to achieve in Java, which led me to rule out that option.  In the end, I took a middle ground.

Click for full size

As before, a common superinterface sits above the entire representative hierarchy.  To understand this hierarchy a little better, perhaps it would be helpful to look at the full source for Value:

public interface Value<T> {
    public T resolve() throws IOException;
    public boolean isResolved();
}

The resolve() method is really the core of the entire parser.  The concept is that each value will be able to consume the bytes necessary to determine its own value, which is converted and returned.  This is extremely convenient because it enables VariantValue(s) (such as string) to carry the logic for parsing to a specific length, rather than the conventional e terminator.  In order to avoid clogging up memory, the return value of resolve() should not be memoized (though, there is nothing in the framework to prevent it).  Conventionally, values which are already resolved should throw an exception if they are resolved a second time.  This prevents the framework from holding onto values which are no longer needed.

You will also notice that CompositeValue not only inherits from Value, but also from the JDK interface, Iterable.  Logically, a composite value is a linear collection of values, consumed one at a time.  To me, that sounds a lot like a unidirectional iterator.  We can, of course, resolve the entire composite at once, mindlessly consuming all of its values, but since all of the values are lost once consumed, the only purpose for such an action would be if we know that we don’t care about a particular composite and we just want to rapidly skip to the next value in the stream.

Returning to primitive values, the resolve() method for IntegerValue is worthy of note, not so much for its uniqueness, but because it is very similar to the parsing technique used in all the other values:

public Long resolve() throws IOException {
    if (resolved) {
        throw new IOException("Value already resolved");
    }
    resolved = true;
 
    boolean negative = false;
    long value = 0;
 
    int b = 0;
    while ((b = is.read()) >= 0) {
        int digit = b - '0';
 
        if (digit < 0 || digit > 9) {
            if (b == '-') {
                negative = true;
            } else if (b == 'e') {
                break;
            } else {
                throw new IOException("Unexpected character in integer value: " 
                    + Character.forDigit(b, 10));
            }
        } else {
            value = (value * 10) + digit;
        }
    }
 
    if (negative) {
        value *= -1;
    }
 
    return value;
}

The i prefix itself is consumed before control flow even enters this method.  This is because the prefix is required to determine the appropriate value implementation to use.  Specifically, the logic to perform this determination is contained within the Parser class, which maintains a map of Value(s) and their associated prefixes.  String values have special logic associated with them, as they do not have a prefix.

As with most hand-coded parsers, this one operates on the principle of “eat until it hurts”.  We start out by assuming that the integer value extends to the end of the stream, then we set about to find a premature end to the integer, at which point we break out and call it a day.  Since we are moving from left to right through a base-10 integer, we must multiply the current accumulator by 10 prior to adding the new digit. 

Actually, the real heart of the parser framework is CompositeValue.  This class is inherited by Parser to define a special value encompassing the stream itself (which is viewed as a composite value with no delimiters and only a single child).  This unification allows us to keep the code for parsing a composite stream in a single location.  This implementation is a little less concise than the code for parsing an integer, but it follows the same pattern and is fairly instructive:

protected final Value<?> parse() throws IOException {
    if (resolved) {
        throw new IOException("Composite value already resolved");
    }
 
    if (previous != null) {
        if (!previous.isResolved()) {
            previous.resolve();        // ensure we're at the right spot in the stream
        }
    }
 
    byte b = -1;
    if (readAhead instanceof Some) {
        b = readAhead.value();
        readAhead = new None<Byte>();
    } else {
        b = read();
    }
 
    if (b >= 0) {
        Class<? extends Value<?>> valueType = parser.getValueType(b);
 
        if (valueType != null) {
            return previous = Parser.createValue(valueType, parser, is);
        } else if (b > '0' && b <= '9') {
            return previous = readString(b - '0');
        } else if (b == ' ' || b == '\n' || b == '\r' || b == '\t') {
            return parse();        // loop state
        } else {
            throw new IOException("Unexpected character in the parse stream: " 
                + Character.forDigit(b, 10));
        }
    }
 
    throw new IOException("Unexpected end of stream in composite value");
}
 
private final StringValue readString(long length) throws IOException {
    int i = is.read();
 
    if (i >= 0) {
        byte b = (byte) i;
 
        if (b == ':') {
            return Parser.createValue(StringValue.class, parser, 
                new SubStream(is, length));
        } else if (b >= '0' && b <= '9') {
            return readString((length * 10) + b - '0');
        } else {
            throw new IOException("Unexpected character in string value: " 
                + Character.forDigit(i, 10));
        }
    }
 
    throw new IOException("Unexpected end of stream in string value");
}

It seems a bit imposing, but really this code is more of the same logic we saw previously when dealing with integers.  The only value type which really gives us trouble here is string.  We can’t simply treat it like the others because it has no prefix.  For this reason, we must assume that any unbound integer is an inclusive prefix for a string.  In most parser implementations, this would require backtracking, but because we are doing this by hand, we can condense the backtrack into an inherited parameter (borrowing terminology from attribute grammars), avoiding the performance hit.

There’s one final bit of weirdness which deserves attention before we bail on this small epic: dictionary values.  Intuitively, a dictionary value should be parsed into a Java Map, or some sort of associative data structure.  Unfortunately, a map is by definition a random access data structure.  Since we are dealing with a sequential bencode stream, the only recourse to satisfy this property would be to page the entire dictionary into memory.  This of course violates one of the primary requirements which is to avoid using more memory than necessary.

The solution I eventually chose to this problem was to limit dictionary access to sequential, which translates into alphabetical given the nature of bencode dictionaries.  Thus, a dictionary can be parsed in the same way as a list, where each element is a sequential key and value, jointly represented by EntryValue.  To make usage patterns slightly easier, EntryValue memoizes the key and value.  Due to the fact that both of these objects are themselves Value(s), this does not lead to inadvertent memory bloat.

Conclusion

Hopefully the parser and generator presented here will be of some utility in situations where you have to parse large volumes of bencoded data.  The API is (admittedly) bizarre and difficult to deal with, but the performance results are difficult to deny.  This framework is currently deployed in production, where benchmarks have shown that it imposes little-to-no runtime overhead, and practically zero memory overhead (despite the sizeable amounts of data being processed).

For convenience, I actually created a Google Code project for this framework so as to facilitate its development internally to the project I was working on.  The end result of this is unlike most of my experiments, there is actually a proper SVN from which the source may be obtained!  A packaged JAR may be obtained from the downloads section.

JRuby Interop DSL in Scala

24
Mar
2008

JRuby is an amazing bit of programming.  It has managed to rise from its humble beginnings as a hobby project on SourceForge to the most viable third-party Ruby implementation currently available.  As far as I am aware, JRuby is the only Ruby implementation other than MRI which is capable of running an unmodified Rails application.  But JRuby’s innovation is not just limited to a rock-solid Ruby interpreter, it also provides tight integration between Java and Ruby.

There’s a lot of material out there on how to replace Java with Ruby “glue code” in your application.  The so-called “polyglot programming” technique states that we should embrace multiplicity of language in our applications.  Java may be very suitable for the core business logic of the application, but for actually driving the frontend UI, we may want to use something more expressive (like Ruby).  JRuby provides some powerful constructs which allow access to Java classes from within any Ruby application.  For example:

require 'java'
 
JFrame = javax.swing.JFrame
JButton = javax.swing.JButton
JLabel = javax.swing.JLabel
 
BorderLayout = java.awt.BorderLayout
 
class MainWindow < JFrame
  def initialize
    super 'My Test Window'
 
    setSize(300, 200)
    setDefaultCloseOperation EXIT_ON_CLOSE
 
    label = JLabel.new('You pushed the button', JLabel::CENTER)
      label.visible = false
    add label
 
    button = JButton.new 'Push Me'
    button.add_action_listener do
      label.visible = true
    end
    add(button, BorderLayout::SOUTH)
  end
end
 
window = MainWindow.new
window.visible = true

sshot-1  sshot-2 

Not a terribly complex example, but it illustrates some of the major advantages of JRuby.  Notice how clean and concise this code is.  It wouldn’t have been much longer had I done this using Java, but it would certainly have been less readable.  Ruby is absolutely perfect for this sort of use case (driving a UI).

As I said though, there are a myriad of examples showing this sort of thing.  As such, it’s not a very interesting topic for a posting.  What the masses have failed to cover, however, is how to accomplish the opposite: calling from Java into Ruby.

Likely the reason this topic has received less attention is because Java is the language will the veritable zoo of libraries and frameworks.  The amount of effort and research that has been put into Java simply dwarfs the comparative immaturity of the Ruby offerings.  Given the disparity, why would you even want to call into Ruby from Java?  This conclusion seems logical until one remembers that almost any application which uses Ruby for the frontend must actually pass flow control to Ruby at some point.  This means calling some sort of Ruby code.

The Java Way

There is some information available on the JRuby Wiki.  The wiki article really should include the caveat that “some experimentation may be required.”  Sufficient information is available, but it is neither intuitive nor convenient.  From Java, the syntax for executing an arbitrary Ruby statement looks like this:

ScriptEngineManager m = new ScriptEngineManager();
ScriptEngine rubyEngine = m.getEngineByName("jruby");
ScriptContext context = engine.getContext();
 
context.setAttribute("label", new Integer(4), ScriptContext.ENGINE_SCOPE);
 
try {
    rubyEngine.eval("puts 2 + $label", context);
} catch (ScriptException e) {
    e.printStackTrace();
}

It’s a typical Java API: over-bloated, over-designed and over-generic.  What would be really nice is to have a syntax for accessing Ruby objects that is as seamless as accessing Java from Ruby.  I want to be able to call Ruby methods and use Ruby classes with the same ease that I can use Java methods and classes.  In short, I want an internal DSL for Ruby.

Unfortunately, Java is a bit constrained in this regard.  Java’s syntax is extremely rigid and does not lend itself well to DSL construction.  It’s certainly possible, but the result is usually less than satisfactory.  We could certainly construct an API around the the Java Scripting API (JSR-233) which provides more high-level access (such as direct method calls and object wrappers), but it would be clunky and only a marginal improvement over the original.

The good news is there’s another language tightly integrated with Java that has a far more flexible syntax.  Rather than building our JSR-233 wrapper in Java, we can avail ourselves of Scala’s power and flexibility, hopefully arriving at a DSL which approaches native “feel” in its syntax.

The Scala Way

Since we’re attempting to construct a tightly-integrated API for language calls, the most effective route would be to apply techniques already discussed in the context of DSL design.  As always, we start with the syntax and allow it to drive the implementation:

// syntax.scala
import com.codecommit.scalaruby._
 
object Main extends Application with JRuby {
  require("test")
 
  associate('Person)(new Person(""))
 
  println("Received from multiply: " + 'multiply(123, 23))
  println("Functional test: " + funcTest('test_string))
 
  val me = new Person("Daniel Spiewak")
  println("Name1: " + me.name)
  println("Name2: " + (me->'name)())
 
  me.name = "Daniel"
  println("New Name: " + me.name)
 
  println("Person#toString(): " + me)
 
  val otherPerson = 'create_person().asInstanceOf[AnyRef]
  println("create_person type: " + otherPerson.getClass())
  println("create_person value: " + otherPerson.send[String]("name")())
 
  eval("puts 'Ruby integration is amazing'")
 
  def funcTest(fun:(Any*)=>String) = fun()
}
 
class Person(name:String) extends RubyClass('Person, name) {
  def name = send[String]("name")()
  def name_=(n:String) = 'name = n
}

And the associated Ruby code:

# test.rb
class Person
  attr_reader :name
  attr_writer :name
 
  def initialize(name)
    @name = name
  end
 
  def to_s
    "Person: {name=#{name}}"
  end
end
 
def test_string
  'Daniel Spiewak'
end
 
def multiply(a, b)
  a * b
end
 
def create_person
  Person.new 'Test Person'
end

Obviously we’re going to need some heavy implicit type conversions.  The important thing to note is that we don’t see any residue of the Java Scripting API, it’s all been encapsulated by our DSL.  We’ve taken an API which is oriented around single-call, low-level invocations and created a high-level wrapper framework which allows method calls, instantiation and even some form of type-checking.

Starting from the top, we see a call which should be familiar to Rubyists, the require statement.  In our framework, this method call is just a bit of syntactic sugar around a call to eval(String).  This semantics are basically the same as within Ruby directly, with the exception of how Ruby source files are resolved.  Any script file on the CLASSPATH is fair game, in addition to the normal Ruby locations.  This allows us to easily embed Ruby scripts within application JARs, libraries and other Java distributables.

Moving down a bit further, we find a somewhat mysterious call to the curried associate(Symbol)(RubyObject) method.  The purpose of this invocation will become more apparent later on.  Suffice it to say that this step is necessary to allow Scala class wrappers around existing Ruby classes.

On the next line of interest, we see for the first time how the framework allows for seamless Ruby method invocation.  Unlike Ruby, Scala doesn’t allow us to simply handle calls to non-existent methods.  Because of this limitation, we have to be a bit more clever in how we structure the syntax.  In this case, we use Scala symbols to represent the method.  There doesn’t seem to be a terribly good explanation of symbols in Scala, but there’s plenty of information regarding how they work in Ruby.  Since the concepts are virtually identical, techniques are cross-applicable.

The key to the whole “symbols as methods” idea is implicit type conversion.  The JRuby trait inherits a set of conversions which look something like this:

implicit def sym2Method[R](sym:Symbol):(Any*)=>R = send[R](sym2string(sym))
implicit def sym2MethodAssign[R](sym:Symbol) = new SpecialMethodAssign[R](sym)
 
private[scalaruby] class SpecialMethodAssign[R](sym:Symbol) {
  def intern_=(param:Any) = new RubyMethod[R](str2sym(sym2string(sym) + "="))(param)
}

Though we haven’t looked at it yet, it is possible to infer the purpose of the send(String) method.  It’s function is to prepare a call to a Ruby method without actually invoking it.  This distinction allows us to pass Ruby methods around as method parameters, just like standard Scala methods.  The method returned is actually an instance of class RubyMethod[R] (where R is the return type).  Scala allows classes to extend structural types like methods, allowing us to redefine the method invocation semantics for wrapped Ruby calls.

class RubyMethod[R](method:Symbol) extends ((Any*)=>R) {
  import JRuby.engine
 
  override def apply(params:Any*) = call(params.toArray)
 
  private[scalaruby] def call(params:Array[Any]):R = {
    val context = engine.getContext()
    val plist = new Array[String](params.length)
 
    for (i <- 0 until params.length) {
      plist(i) = "res" + i
      context.setAttribute(plist(i), JRuby.resolveValue(params(i)), ScriptContext.ENGINE_SCOPE)
      plist(i) = "$" + plist(i)
    }
 
    evaluate(() => if (plist.length > 0) {
      sym2string(method) + "(" + plist.reduceLeft[String](_ + ", " + _) + ")"
    } else {
      sym2string(method) + "()"
    })
  }
 
  protected def evaluate(invoke:()=>String):R = {
    val toRun = invoke()
    Logger.getLogger("com.codecommit.scalaruby").info(toRun)
 
    JRuby.handleExcept(JRuby.wrapValue[R](engine, engine.eval(toRun, engine.getContext())))
  }
}

The gist of this code is simply to assign every parameter value to an attribute in the Ruby runtime.  Attributes of ENGINE_SCOPE (as defined by JSR-233) are represented as global variables within Ruby.  These variables are named sequentially starting from zero.  (e.g. $res0, $res1, …)  As you can imagine, this technique tends to be a bit of a concurrency killer.  To keep things simple, I decided to completely ignore the issues associated with asynchronous execution.  It is certainly possible to adapt the framework to function in a multi-threaded environment, but I didn’t bother to do it.  (one of the perks of blogging is a license to extreme laziness)

Once these parameters are assigned, the method call is evaluated within the context of the runtime.  This is done by literally generating the corresponding Ruby code (done in the anonymous method) and then wrapping the return value in an instance of RubyObject (if necessary).  Note that the send(String) method does not actually kick-start this invocation process at all.  Rather, it creates an instance of RubyMethod[R] which corresponds to the method name.  This class extends (Any*)=>R, so it may be used in the normal “method fashion” - by appending parentheses which enclose parameters (if any).

Supporting Cast

At this point, it’s worth taking a moment to examine the specifics of the framework class hierarchy.  A number of classes exist to wrap around Ruby objects and methods.  We’ve already seen a few of them (RubyMethod[R] and RubyObject), but it’s worth going into more detail as to their purpose and relation to one another. 

Note that these class names often conflict with existing classes in the JRuby implementation.  This odd coincidence is precipitated by the fact that the framework seems to deal with a lot of the same concepts as the JRuby runtime (go figure).  Rather than obfuscating my class naming to avoid conflict, I just assume that you will either make use of the enhanced Scala import feature (as I have in the implementation), or just avoid using the JRuby internal classes.

image

  • RubyObject - The root of the object hierarchy.  This abstract class is designed to encapsulate the core functionality of the generic object (roughly: send, -> and eval) as well as containing all of the implicit type conversions.  Most of the syntax-defining magic happens here (more on this later).
  • JRuby - This is the primary type interface between the developer and the framework.  Classes which wish to make use of Ruby integration must inherit from this trait.  This is where the Logger (for executed statements) is initialized and deactivated.  Within the corresponding object, all of the backend resources are managed.  This is where the actual ScriptEngineManager instance lives, as well as a set of utility methods to handle wrapping and unwrapping of framework-specific objects.
  • RubyWrapperObject - This implementation of RubyObject is designed to wrap around instances which already exist within the Ruby interpreter.  For example, if a Ruby method returns an instance of ActiveRecord::Base, it will be represented in Scala by a corresponding instance of RubyWrapperObject.  Note that objects which are equivalent in the Ruby interpreter are not guaranteed to be pointer-equivalent.  However, the equals(Object) method is well-defined within RubyObject, thus comparisons between RubyObject instances will return sane results.  The == method in Scala is defined in terms of equals(Object), so existing code will behave rationally.
  • RubyClass - With the exception of the JRuby trait, this is likely the only class within the framework which the developer will have to reference explicitly.  This class allows developer-defined Scala classes to wrap around existing Ruby classes, providing type-safe method calls and even extended functionality.  More on this feature later.
  • RubyMethod - We’ve already seen how this serves as a wrapper around calls to Ruby methods.  However, its default implementation assumes that the method is defined in the global namespace.  This is impractical for many method calls (such as dispatch on an object).
  • RubyInstanceMethod - This class solves the problem of object dispatch with RubyMethod.  All of the core functionality is identical to its superclass with the exception of the generated Ruby code.  Instead of just generating a method call passing parameters, this class will generate a method call on a given Ruby object.  Thus, this class depends upon RubyWrapperObject which maintains a reference to a corresponding Ruby instance.

Alternative Dispatch

Not every method call is made on the enclosing scope.  Sometimes it is necessary to call a method in an object to which you have a reference.  For example, a method may return an instance of a some Ruby class.  This instance will be automatically wrapped by a Scala instance of RubyWrappedObject.  Since this Scala class doesn’t actually define any methods which correspond to the Ruby class, it is necessary to once more utilize the “symbols as methods” trick in method dispatch.  There are two ways to call a method on an object like this: the send[R](String) method (where R is the return type), and the -> (arrow) operator.

Using the arrow operator is a lot like normal method calls, except with symbols instead of method names.  Just like dispatch on the enclosing scope, the call is converted into an instance of RubyMethod (actually, an instance of RubyInstanceMethod) which can then be used as a standard Scala method.  The difference between using arrow and dispatching on the enclosing scope is the syntax must be a little more contrived.

Parentheses have the second-highest priority of all the Scala operators (the dot operator (.) has the highest).  This means that if we simply “follow our nose” where the syntax is concerned, we will arrive at an order of invocation which leads to an undesirable result.  Consider the following sample:

val obj = 'create_person()
obj->'name()

The first call is a standard dispatch on the enclosing scope.  The second call is what is interesting to us.  Reading this line naturally (at least to old C/C++ programmers) we would arrive at the following sequence of events:

  1. Get a reference to the name method from the instance contained within obj
  2. Invoke the method, passing no parameters

Unfortunately, this is not how the compiler sees things.  Because parentheses bind tighter than the arrow operator, it actually resolves the expression in the following way:

  1. Get a reference to the name method contained within the enclosing scope
  2. Invoke the method, passing no parameters
  3. Invoke the -> method on the instance within obj, passing the result of name as a parameter

This is obviously not what we wanted.  Unfortunately, there’s no way to make the arrow operator bind tighter than parentheses.  This is a good thing from a language standpoint, but it causes problems for our syntax.

The solution is to enclose any “arrow dispatch” statement within parentheses so as to force the order of evaluation:

val obj = 'create_person()
(obj->'name)()

It looks a bit weird, but it’s the only way Scala will allow this to work.  This call now evaluates properly, calling the name method on the obj instance, passing no parameters.

There’s actually another problem associated with arrow dispatch in our DSL: Scala already has an implicit meaning for the arrow operator.  The following sample should look familiar to those of you who have worked with Scala in other applications:

val numbers = Map(1 -> "one", 2 -> "two", 3 -> "three")

By default, Scala defines the arrow operator as an alternative syntax for defining 2-tuples.  This is good for most things, but bad for us.  What we want is to define a new implicit type conversion which converts Any into a corresponding instance of RubyWrappedObject.  This would allow us to satisfy the syntax given above.  However, Scala’s 2-tuple syntax already defines an implicit type conversion for the Any type which deals with the arrow operator.  Rather than examining the context to attempt to disambiguate, the Scala compiler simply gives up and prints an error stating that the implicit type conversions are ambiguous.  This poses a bit of a problem and nearly killed the arrow operator idea in design.

The solution is actually to override Scala’s built-in conversion by defining our own conversion with the same name and signature but which provides us with the option of using our own arrow operator definition.  The behavior we want is to allow normal use of the arrow operator when dealing with Any -> Any, but convert to RubyWrappedObject and dispatch when dealing with Any -> Symbol.  After a little digging through the Scala standard library, I arrived at the following solution (defined in RubyObject):

implicit def any2ArrowAssoc[A](a:A) = new SpecialArrowAssoc(a)
 
private[scalaruby] class SpecialArrowAssoc[A](a:A) extends ArrowAssoc(a) {
  def ->(sym:Symbol) = (a match {
    case obj:RubyObject => obj
    case other => new RubyWrapperObject(other)
  })->sym
}

Notice that we extend Scala’s pre-existing ArrowAssoc[A] class (which handles the special 2-tuple syntax) and then overload the -> method to work differently with symbols.  This code now does precisely what we need.  By introducing this extra layer of indirection, as well as by overriding Scala’s existing conversion, we’re able to support the arrow syntax as shown in the above examples.

Sending Messages

There is one final form of dispatch which allows typed return values: send[R](String).  This is actually the method to which all the other dispatch forms delegate (as it is the most general).  This method is very similar to the Ruby send method which allows Smalltalk-style message passing on arbitrary objects.  The really important thing about this method though is that it will automatically cast the return value from the method to whatever type you specify, allowing you to define type-safe wrappers around existing Ruby methods in Scala:

def multiply(a:Int, b:Int) = send[Int]("multiply")(a, b)
 
val result:Int = multiply(123, 23)

send is effectively defined as a curried function since it takes a method name as a parameter and returns an instance of RubyMethod as a result.  This mimics the behavior of dispatch with symbol literals in that you can use send to generate type-safe partially-applied functions for corresponding Ruby methods.

Note that send could just as easily have taken a symbol as a parameter, rather than a string.  However, the metaphor throughout the DSL is “symbols as methods”, thus string was used to avoid logical conflict.  Scala itself was perfectly happy passing symbol literals around in addition to treating them as methods.

Class Wrapping

The final bit of code in the example now so far above us serves as a sample of how one might wrap an existing Ruby class within Scala.  Person is actually a class defined in Ruby (as you can see from the Ruby sources).  It has a read/write attribute, name, as well as an overridden to_s method.  RubyObject already contains the logic for handling calls to toString() and proxying them to Ruby’s to_s, but the name attribute must be handled explicitly in code.

The goal is basically to provide a type-safe wrapper around the Person Ruby class.  We could just as easily dispatch on the automatically wrapped instance of RubyWrappedObject using either syntax described above, but an explicit wrapper is a bit nicer.  The compiler can check things for us, and we can even add methods to the class (at least, as far as Scala is concerned) in true Ruby “open class” style.  All that is necessary to accomplish this wrapper is to extend RubyClass and to define the delegating wrapper methods:

class Person(name:String) extends RubyClass('Person, name) {
  def name = send[String]("name")()
  def name_=(n:String) = 'name = n
}

We specify which Ruby class we are wrapping as the first parameter in the constructor for RubyClass.  The parameters which follow are passed directly to the constructor of the corresponding Ruby class.  This Ruby constructor is invoked automatically, instantiating the corresponding wrapped Ruby object in the background.  Notice that we specify the name of the Ruby class using a symbol.  This is the one place in the framework that we break with the “symbols as methods” metaphor.  The consequence is a nice, clean syntax for Ruby class wrapping.  Unfortunately, it also means that wrapping a class within a non-included namespace (e.g. ActiveRecord::Base) can be a little clunky.  The only way to do it is to explicitly invoke the Symbol(String) constructor.  (this is required because Scala symbols can only contain alpha-numerics and underscores)

Once we have our wrapped class signature, it’s easy to define the delegate methods.  Scala encourages a blurring of field and method, similar to Ruby.  As such, it supports a very Ruby-esque syntax for accessor/mutator pairs.  This makes the wrapped syntax just a bit nicer.  For the accessor, we make a call to the send method, specifying the return type necessary for the wrapper.  The mutator allows us to be a bit more creative.

We don’t really need type-safe return values for a mutator.  We would normally just set the return type as Unit and ignore the result.  Thus we can once again use the symbol dispatch syntax.  Notice that this time we’re not directly treating a symbol as a method.  We’re apparently assigning a value to the symbol using the = operator (corresponds to the operator= assignment operator in C++).  This is possible through a separate implicit type conversion which generates a one-off utility instance:

private[scalaruby] class SpecialMethodAssign[R](sym:Symbol) {
  def intern_=(param:Any) = new RubyMethod[R](str2sym(sym2string(sym) + "="))(param)
}

As you can see, all this method does is generate a new symbol which includes the ‘=’ character and returns the result of dispatching on the corresponding Ruby method.  Note that mutators in Ruby are defined as “=“, thus appending “=” to the method name is the appropriate behavior.

Return Value Wrapping

There’s actually a slight problem involved in allowing Scala wrappers around existing Ruby types.  Well, not so much a problem as an inconsistency.  The problem is simply this: if a Ruby method creates an instance of a Ruby class for which there is a Scala wrapper and returns this value through the framework into Scala, one would expect this value would be wrapped into an instance of the Scala wrapper.  If you look in the example far above, there is an example of this in the create_person method.  The method creates an instance of Ruby class Person and returns it as a result.

Somehow, the framework must identify that there is a corresponding Scala wrapper and then properly create an instance.  This actually poses something of a dilemma in two ways.  Number one, Scala has no equivalent to Ruby’s ObjectSpace, so there’s no way to get a comprehensive list of all classes which have been defined.  Even if we could get this list, the corresponding Ruby class is specified in the constructor parameters to RubyClass, so there’s no way to obtain the information statically from outside the class.  Number two, we have to somehow create an instance of the Scala wrapper class without creating a corresponding instance of the wrapped Ruby class (since we already have one).  This means we need some sort of override in the RubyClass constructor.

The best solution to all of these problems is to introduce the associate method.  The usage is demonstrated at the top of the example where we associate the Person Ruby class with the Person Scala wrapper class.  More specifically, we associate the Ruby class with a pass-by-name parameter which defines how to instantiate the Scala class.  This is an important distinction as it solves our second problem of instance creation.  The framework has no way of knowing what parameters must be passed to the Scala wrapper constructor, so the instantiation itself must be passed:

associate('Person)(new Person(""))

As I mentioned previously, this is a pass-by-name parameter which means that it will not be immediately evaluated, but rather on-demand somewhere in the body of associate.  The associate method actually takes this value and wraps it in an anonymous method which invokes the instantiation each time a value of Ruby type Person must be wrapped.  Just prior to invoking the constructor, an override is put in place within the RubyClass singleton object (not shown in the class hierarchy) to prevent the creation of a corresponding Ruby instance.  This is what allows the new instance of Scala class Person to correspond with an existing Ruby value.  Here again we’re sacrificing concurrency for a hacky work-around to a complex problem.  Any sort of “proper” implementation would have to solve this problem in a more elegant way.

It Never Ends!

This post, that is.  There’s so much more I could ramble on about (I never even talked about how exceptions are handled), but this entry is already far too long.  Hopefully the material presented here only serves to whet your appetite for slicker JRuby-Scala integration and all the benefits it can bring.  I’ve packaged up the framework presented here as a downloadable archive.  The package includes the Ruby engine for the Java Scripting API as well as a jar-complete build made from the JRuby SVN.  The project may work with JRuby 1.0, but I doubt it.  Anyway, JRuby 1.1 is due shortly, so why bother.  Remember that this is extremely untested and very experimental.  (I did warn you about the concurrency issues, right?)  If this is interesting to people, I may do a proper release into an OSS project somewhere.  For right now, I just don’t have the time.  :-(

I hope this entry gives you an idea of what’s involved in Scala DSL implementation, as well as an idea of where such a technique may be useful in your own projects.  After all, what would be better than everyone being able to write their own Rails-killer and define highly fluid APIs!

XMLBuilder: A Ruby DSL Case Study

10
Mar
2008

XML is probably the most ubiquitous and most recognizable format in the modern development landscape.  It’s simple power in representing hierarchically structured data has made it the standard for representing everything from Word documents to databases.  It’s also one of the most verbose and meta-rich syntaxes known to man.

So in that sense, XML is a mixed blessing.  Its flexibility and intuitive nature allows developers to store just about any data in a human readable, easy-to-debug manner.  Unfortunately, its verboseness often makes generating the actual XML a very frustrating and boring foray into the land of boiler-plate.  Various techniques have been developed over the years to smooth this process (e.g. manipulating a DOM tree or reflectively marshalling objects directly to XML), but on the whole, generating XML in code is just as annoying as it has always been.  We’ve all written code like this in the past:

public String toXML() {
    final String INDENT = "    ";
    StringBuilder back = new StringBuilder(
            "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n\n");
 
    back.append("<person name=\"").append(getName()).append("\">\n");
    for (Book book : getBooks()) {
        back.append(INDENT).append("<book>").append(book.getTitle()).append("</book>\n");
    }
    back.append("</person>");
 
    return back.toString();
}

Not the most pleasant of algorithms.  Oh there’s nothing complex or challenging about the code, it’s just annoying (as String manipulation often is).  Things would get a little more interesting if we actually had some sort of recursive hierarchy to traverse, but even then it would still be pretty straight-forward.  XML generation is tedious grudge-work to which we all must submit from time to time.

Domain Specific Languages

There’s a new wave in programming (especially in the communities surrounding dynamic languages like Ruby and Groovy) called “Domain Specific Language”, or DSL for short.  The DSL technique has really been around for a long time, but it’s just now finding its way to the mainstream, “Citizen Joe” developers.  This is because for decades, domain specific languages have been separate languages unto themselves, with their own syntax, quirks, and utter lack of tool support.

For example, if a company wanted to specify their business logic in a more readable format using a DSL, the would have to spend months, sometimes years of effort to build an entirely new language, just for the one application.  While there was no need to generate a truly flexible and extensible syntax (or even one which was Turing-complete), such efforts would still require an enormous amount of work to be put into trivialities like parsers, AST walkers and core libraries.  Obviously this meant that DSLs were extremely uncommon.  There are very few use-cases for something which requires so much and gains you so little.

This variety of domain specific language is called an “External DSL”.  This name stems from the fact that the language is completely independent and external to other languages.  The innovation which made the DSL attainable for the common-man is that of the “Internal DSL”.

People often struggle to precisely define internal DSLs.  In the broadest sense, an internal DSL is a language API carefully structured to satisfy a particular syntax when used in code.  There is no syntax parsing involved in implementing an internal DSL.  Rather, effort is focused on the API itself.  The more flexible the syntax of the underlying language, the more powerful and potentially intuitive the syntax of the DSL can be.  It is for this reason that languages such as Ruby, Python, Groovy and company are often used to implement internal DSLs.  These languages are defined by extremely flexible and dynamic syntax, lending themselves perfectly to such efforts.

With this in mind, it should theoretically be possible to design an “API” for XML generation that’s simple and intuitive.  The DSL could be implemented using Ruby, though actually sufficiently dynamic language could do.  Implementing an internal DSL is a notoriously difficult task, so perhaps a step-by-step walk through is in order.

Getting Started

The very first step in creating an internal DSL is to design the syntax.  Similar to how test-driven development starts with a set of unit tests and builds functionality which satisfies the tests, DSL development starts with a few code samples and builds an API to satisfy the syntax.  This step will guide all of the code we write as we implement the DSL.

One of the primary goals for our DSL syntax should be to reflect (as much as possible) the structure and essence of the XML output in the generating code.  One of the major shortcomings of the ad hoc “string concat” XML generation technique is an utter lack of logical structure in the code.  Sure your algorithm may be nicely organized and formatted, but does it really reflect the actual structure of the XML it’s generating?  Probably not.  Another major goal should be brevity.  XML generating code is extremely verbose, and the whole idea behind writing a DSL for XML generation is to elevate some of this hassle.

So, we’ve got brevity and logical structure.  As minor goals, we also may want to spend some effort making the syntax versatile.  We don’t want the algorithm to perform incorrectly if we try to generate a document with the < character in some field.  With that said however, we don't need to be overly concerned with flexibility.  Yes, our DSL should be as capable as possible, but not to the detriment of brevity and clarity.  These concerns are paramount, functionality can take a back seat.

With these goals in mind, let’s try writing a static syntax for our Person/Book example above. Bear in mind that every construct used in the syntax must be valid Ruby which the interpreter will swallow.  How it will swallow is unimportant right now.  For this step, it’s helpful to remember that the ruby -c command will perform a syntax check on a source file without actually attempting to run any sort of interpretation.

xml.person :name => 'Daniel "Leroy" Spiewak' do
  book do
    'Godel, Escher, Bach'
  end
  book do
    'The Elegant Universe'
  end
  book do
    'Fabric of the Cosmos'
  end
  book do
    "The Hitchhiker's Guide to the Galaxy"
  end
  book    # book with no title
end
 
puts xml  # prints the generated XML

Note that this is all static, there’s no dynamically generated data to muddle the picture.  We can worry about that later.

This code sample gives us a rough idea of what the syntax should look like.  According to ruby -c, it’s valid syntax, so we’re ready to proceed.  Everything looks fairly unencumbered by meta-syntax, and the logical structure is certainly represented.  Now we do a quick mental pass through the syntax to ensure that all of our functional bases are covered.  In the example we’ve used attributes, nested elements and text data.  One of our attributes is making use of illegal XML characters (double quotes).  We haven’t tried mixing elements and text yet, but it’s fairly obvious how this could be done.

Strictly speaking, we haven’t rigorously defined anything.  What we have done is given ourselves a framework to build off of.  This sample will serve as an invaluable template to guide our coding.

Parsing the Syntax

Remember I said that internal DSLs do not involve any syntax parsing?  Well, in a way I was wrong.  The next thing we need to do is walk through the syntax ourselves and understand how Ruby will understand it.  This step requires some fairly in-depth knowledge of Ruby’s (or whatever language you’re using) syntax.  Hopefully the following diagram will help to clarify the process.

image1

I’ve annotated the areas of interest in cute, candy colors to try and illustrate roughly what the Ruby syntax parser will see when it looks at this code.  It’s important to understand this information since it is the key to translating our contrived syntax into a rigorous API.

For the sake of this discussion, I’m going to assume you have a working knowledge of Ruby and have already recognized some of the more basic syntax elements within the sample.  For example, the do/end block is just that, a Block object (roughly the Ruby equivalent of a closure).  The annotated String literal is an implicit return value for the inner block (the last statement of a block is implicitly the return value).  This syntax employs a Ruby “trick” which allows us to drop the return statement (an important trick, since you can’t return from a block anyway).

Anyone familiar with Ruby on Rails should be aware of Ruby’s convenient Hash literal syntax employed for the element attributes.  As annotated, this literal is being passed as a method parameter (Ruby allows us to drop the parentheses).  Here again we’re using little oddities in the Ruby syntax to clean up our DSL and reduce unnecessary cruft.

You should also be familiar with the blurred line between variable and method so common in Ruby.  As annotated, the xml syntax element could be either a local field, or a method that we’re invoking without the parentheses.  To clear up this point, we need to refer to the full sample above.  In no place do we declare, nor do we provide for the declaration of a local variable xml.  Thus, xml() will be a method within our API