Skip to content

Scala Collections for the Easily Bored Part 1: A Tale of Two Flavors

21
Jul
2008

One of the most obvious things to a Java developer first coming into Scala-land is the radically different Collections API included as part of the standard library.  For the most part, we use the same frameworks and APIs in Scala as are available in Java.  This is natural, thanks to the extremely tight integration between the two languages.  So why is this one area such a startling departure from Scala’s heritage?

The answer has to do with what the Scala language is syntactically capable of handling.  Scala isn’t just an object-oriented language, it is also highly functional.  It is only natural that such an integral part of the core libraries would reflect this fact.  Unfortunately, most developers fail to take full advantage of the power offered by the collections API.  Despite the available power, most code written using Scala’s collections tends to look a lot like Java in disguise.

I had actually planned on addressing this topic in a single article.  However, Scala’s collections are so vast and powerful (sounds like one of Roddenberry’s alien consciousness beings) that it really would overrun the limits of conventional blogging to attempt to cover it all in a single post.  Despite the fact that I’ve been creating mammoth anthologies of late, I think it’s probably better to break it into bite-sized chunks.  First up: the confusing dual nature of Scala’s collections API!

A Tale of Two Flavors

The very first thing developers notice when looking at Scala’s collections is a (seemingly) odd redundancy in the specification.  Looking under the scala.collection package, we see not one, but three separate sub-packages, each containing what seem to be reimplementations of the same types.  For example, consider the following three traits:

I don’t know about you, but this confused the heck out of me the first time I really dug into Scala’s standard API.  Actually, it gets even worse when you discover that there is also a trait (and companion object), scala.collection.Map, which is actually a super-trait of the three listed above.  Seems like Dr. Odersky discovered the magic of separated namespaces and reacted like a two-year-old on espresso.

As it turns out, there’s a very logical reason for having these separated and seemingly-redundant collections packages.  Of the three, one of them can be discounted immediately as uninteresting.  The jcl package contains collections, but they are merely wrappers for the corresponding Java collections.  This allows more efficient transmutation between the Java collections API and Scala.  It is almost never necessary to use this package directly, since a number of implicit conversions are built into Scala to make the process essentially transparent.

Of course, this still leaves the immutable and mutable packages.  This distinction traces back to some of Scala’s functional roots.  As you are likely aware, Scala supports both mutable and immutable variables, as denoted by the var and val keywords, respectively.  While there are some significant differences at compile-time, conceptually, the only distinction between these two types is the former allows reassignment whereas the latter does not.  Mutable variables are a common - and indeed, essential - feature in imperative languages (Java, C++, Ruby, etc).  For example, here’s how we would sum an array of integers in Java:

int[] numbers = ...
 
int sum = 0;
for (int n : numbers) {
    sum += n;   // reassign sum to accumulate n
}

In this case, sum is a mutable variable which accumulates the total value of all numbers in the array, numbers.  Theoretical disputes aside, this style of programming is simply impossible in certain languages.  For example, SML provides no mechanism for declaring mutable variables.  So if we want to sum the values in a list of ints, we have to do it in some other way (code provided for the curious):

fun sumList ls = foldl (op +) 0 ls

In Scala, both of these techniques are available to us.  However, despite providing for mutable state, Scala does encourage developers to avoid it.  The reasoning behind this is that mutable state has a tendency to make code more difficult to reason about, making testing much harder.  Also, as any experienced developer will tell you, mutable state kills concurrency.

Not only does Scala encourage the use of immutable variables, it also encourages the use of immutable collections.  This concept may seem a little bizarre to those of you coming from an imperative background (I know it did to me), but it actually works.  While a map container which cannot be modified probably seems a little useless, it actually provides a startling degree of freedom from concerns like concurrent locking and unintended side-effects.  With immutable collections, you are able to manipulate objects with perfect assuredness that no method call will “accidentally” alter its contents.  How many times have you cloned a collection prior to returning it?  Or how often have you dug through someone else’s code just to assure yourself that it is safe to call a given method passing an instance of List?  Immutable data structures completely solve that problem.

Naturally, immutable collections require very different patterns and idioms than those which are mutable.  My ML code from above illustrates this to a very small degree, using a fold to traverse an immutable list.  As a general rule, immutable data structures are only useful when being passed around via method calls.  A common pattern is to build up a data structure recursively, creating a new instance with one more element at each invocation:

def toSet[T](list: List[T]) = {
  def traverse(list: List[T])(set: Set[T]): Set[T] = list match {
    case hd :: tail => traverse(tail)(set + hd)   // create a new Set, adding hd
    case Nil => set
  }
 
  traverse(list)(Set[T]())
}
 
val names = List("Daniel", "Chris", "Joseph", "Renee")
val nameSet = toSet(names)

At each recursive invocation of traverse, a new Set is created, based on the contents of the old one, set, but with one additional member, hd.  At the same time, we are deconstructing an immutable List, list, selecting its first element and whatever remains at each step.  Whenever you work with immutable data structures, you will see a lot of code which looks like this.

Of course, the natural question which comes to mind is, what about performance?  If each invocation actually creates a brand new Set for every recursive call, doesn’t that require a lot of inefficient object copying and heap operations?  Well, as it turns out, this isn’t really the case.  Yes, a new instance must be created at each turn, which is is a comparatively expensive operation on the JVM, but almost nothing is being copied.  All of Scala’s immutable data structures have a property known as persistence, which means that you don’t copy the data out of the old container when creating a new, you just have the new reference the old and treat all of its contents as if they were its own.  A linked list is a good example of this, since each node of a list contains exactly one element, as well as a reference to another node.  If we think of each node as a representative of a list starting with itself and traversing to the end, then list suddenly becomes a fully persistent data structure (since the new list contains a sub-list in its entirety and additive operations require no data copying).  Rich Hickey, the creator of Clojure (a Lisp dialect running on the JVM), has an excellent presentation which explains some of the hows and whys behind this technique (as well as some other interesting topics).  Chapter 19 of Programming in Scala (Odersky, Spoon & Venners) also has a good example of a persistent immutable queue.

Conclusion

I happen to like immutable data, so most of this series uses the scala.collection.immutable package.  However, there are certainly situations where mutable data structures are the only way to go, either because of system requirements or for performance reasons.  Fortunately, Scala’s mutable collections have an almost identical interface to its immutable collections.  Thus, most of the in formation presented here is applicable to both branches of the library.

Now that we have laid a basic foundation regarding the fundamentals of Scala’s collections framework, we can move onto more interesting things.  The next installment will deal with fold and map, the bread and butter of every collection.

Bencode Stream Parsing in Java

15
Jul
2008

It’s surprising how universal XML has become.  It doesn’t seem to matter what the problem, XML is the solution.  For example, consider a simple client/server architecture where the communication protocol must transmit some sort of structured data.  Nine developers out of ten will form the basis of the protocol around XML.  If it’s a lot of data to be transferred, then they will compress the XML using Java’s stream compression libraries.  If there’s binary data to be transmitted, it will either be stored as CDATA within the XML or as files within the same compressed archive.  Very few developers will actually stop and consider alternative solutions.

One such “alternative solution” is bencode (pronounced “bee-encode”).  Similar to formats like XML and JSON, bencode defines a series of constructs which may be used to encode arbitrarily complex data.  However, unlike XML, the design focus of the format was not to produce verbose, human-readable documents, but rather to encode data in the most concise manner possible.  To that end, the core bencode specification only includes four data types, two simple and two composite structures.  These types are defined with an almost complete absence of meta, requiring very little “structure” to clutter the data stream.

Unfortunately, outside of applications like BitTorrent, this elegant binary format has seen remarkably little adoption.  Because of this state of affairs, it can be extremely difficult to find libraries to actually process bencode data.  Not too long ago, I ran into a production use-case which required both parsing and generation of bencode-formatted files.  I considered digging into the source code for Vuze (nee “Azureus”), but a) it seemed like a lot of boring, nearly-wasted effort, and b) I strongly suspect that their bencode parser and generator are extremely space inefficient, since the data sources which they deal with are remarkably small.

The second hang-up was really a more significant motivator than the first, due to the fact that I knew I would be dealing with bencode streams potentially gigabytes in size.  So, rather than fruitlessly dig through someone else’s code, I decided to put all of this formal parser theory to work and roll my own library.  Unless you’re already familiar with bencode, I suggest you read the Wikipedia article to get a feel for the format, otherwise some of what I will be talking about will make no sense at all.  :-)

The first thing I needed to do was build the generation half of the library.  I decided that it would be easier if I avoided trying to use the same backend framework classes with both the generator and the parser.  For example, there are actually two classes in the framework which contain the logic for handling an integer: IntegerValue and IntegerType.  The former is for use in the parser, while the latter is for use in the generator.  This separation of logic may seem a little strange, but it actually simplifies things tremendously.

Remember my primary requirement: extremely efficient implementation of both generator and parser, especially with respect to space.  If I attempted to use the same classes to represent data for both the parser and the generator, then the parser would be forced to read the entire stream into some sort of in-memory representation (think about it; it’s actually true).  Obviously, this is unacceptable for streams that are gigabytes in size, so the traditional “good design” from an object-oriented standpoint was out.

Stream Generation

Since I needed the functionality of bencode stream generation before I needed parsing, I started with that aspect of the framework.  Here again, the most obvious “object-oriented” approach would have been the wrong one.  When we think of generating output in a structured format programmatically, we naturally imagine a DOM-like tree representation (preferably framework-agnostic) which is then walked by the framework to produce the output.  The major disadvantage to this approach is that it requires paging everything into memory.  This works for smaller applications or situations where the data is already in memory, but for my particular use-case, it would have been disastrous.

The only way to avoid paging everything into memory for stream generation is to structure the API so that the data is “pulled” by the generator, rather than “pushed” to it in tree-form.  In other words, the data itself has to be lazy-loaded, using callbacks to grab the data as-needed and hold it in memory only as long as is absolutely necessary.  In a functional language, this would be done with closures (or even normal data types in a pure-functional language).  However, as we all know, Java does not support such time-saving features.  The only recourse is to use abstract classes and interfaces which can be overridden in anonymous inner-classes as well as top-level classes as necessary.

image

After a bit of experimentation, the finalized hierarchy looks something like this.  Logically, every type must be able to query its abstract method for data of a certain Java type (long for IntegerType, InputStream for StringType, etc), convert this data into bencode with the appropriate meta, and then write the result to a given OutputStream.  Also following our nose, we see the semantic differences between composite and primitive types are really quite limited, especially if we simplify everything to a black box “get data / write encoding” methodology.  In fact, the only thing that CompositeType actually does is enforce the prefix/suffix encoding of every composite type.  Since this is in compliance with the bencode specification, we are safe in extracting this functionality into a superclass.

The more interesting distinction is between so-called “variant” and “invariant” types.  This is where you should begin to notice that I have over-engineered this library to some degree.  If I was just trying to create a pure bencode generator, then I could have skipped InvariantPrimitiveType and VariantPrimitiveType and just let IntegerType and StringType extend PrimitiveType directly.  This comes back to my initial requirements.

Priority one was to create a framework which was blazingly fast, but priority two was to ensure that it was extensible at the type level.  For the particular application I was interested in, I required more than just the core bencode types.  Also on the agenda were proper UTF-8 strings, dates, and support for null.  To accommodate all of this without too much code duplication, I knew I would have to extract a lot of the functionality into generic superclasses.  Hence my somewhat incorrect use of the terms “variant” and “invariant” to describe the difference between the integer type - which is prefix/suffix delimited - and the string type - which defines a length as its prefix and has no closing suffix.

Anyway, back to the problem at hand.  In addition to the CompositeType and PrimitiveType, you should also notice EntryType.  This “extra” type exists to handle the fact that bencode dictionaries are extremely weird and sit rather outside the “common functionality” umbrella of the format in general.  For one thing, the specification requires that dictionary entries be sorted by key, obviously implying some sort of Comparable relation.  Moreover, these keys must be themselves strings, but StringType isn’t comparable because its writeValue(OutputStream) method doesn’t return the data in question, but merely writes it to a given OutputStream.  Are we starting to see the problems with space-efficient implementations?

Enough babble though, let’s see some code!  Here’s how we might encode some very simple data using the generator framework:

public class GeneratorTest {
    public static void main(String[] args) {
        ByteArrayOutputStream os = new ByteArrayOutputStream();
        final byte[] picture = new byte[0];        // presumably something interesting
 
        DictionaryType root = new DictionaryType() {
            @Override
            protected void populate(SortedSet<EntryType<?>> entries) {
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("name"), 
                        new LiteralStringType("Arthur Dent")));
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("number"), 
                        new IntegerType(42)));
 
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("picture"), 
                        new StringType() {
 
                    @Override
                    protected long getLength() {
                        return picture.length;
                    }
 
                    @Override
                    protected void writeValue(OutputStream os) throws IOException {
                        os.write(picture);
                    }
                }));
 
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("planets"), 
                        new ListType() {
 
                    @Override
                    protected void populate(ListTypeStream list) throws IOException {
                        list.add(new LiteralStringType("Earth"));
                        list.add(new LiteralStringType("Somewhere else"));
                        list.add(new LiteralStringType("Old Earth"));
                    }
                }));
            }
        };
 
        try {
            root.write(os);
        } catch (IOException e) {
            e.printStackTrace();
        }
 
        System.out.println(new String(os.toByteArray()));
    }
 
    private static class LiteralStringType extends StringType 
            implements Comparable<LiteralStringType> {
        private final String value;
 
        public LiteralStringType(String value) {
            this.value = value;
        }
 
        @Override
        protected long getLength() {
            return value.length();
        }
 
        @Override
        protected void writeValue(OutputStream os) throws IOException {
            os.write(value.getBytes("US-ASCII"));
        }
 
        public int compareTo(LiteralStringType o) {
            return o.value.compareTo(value);
        }
    }
}

It’s hard to imagine why some people claim that Java is a verbose language…

The API may seem a little clumsy, but most of that is caused by the conniptions required to make the generator lazily pull the data, rather than paging it all into memory ahead of time.  Throwing that aside, the rest of the verbosity seems to come from the need for LiteralStringType, rather than just having a StringType which could handle this for us.  The reason for this extra headache is shown in the population of the “picture” field, which presumably may contain several megabytes worth of data from some external source such as a file or database (in this case of course, it doesn’t contain anything, but that’s besides the point).

The result of the above is as follows:

d4:name11:Arthur Dent6:numberi42e7:picture0:7:planetsl5:Earth14:Somewhere else9:Old Earthee

Or, with a little formatting to make it more palatable:

d
  4:name
  11:Arthur Dent

  6:number
  i42e

  7:picture
  0:

  7:planets
  l
    5:Earth
    14:Somewhere else
    9:Old Earth
  e
e

Technically, this is no longer valid bencode, but it is much easier to read this way.

The Parser

With all this bustle surrounding the generator, it’s easy to forget about the inverse process: parsing.  As it turns out, this is both easier and far less elegant than the solution for the generator (I know, it’s a sad state of affairs when the above is considered “elegant”).  Here again, there was a need for the parser to be extremely efficient, especially in terms of memory.  Thus, the logical approach of simply parsing the stream into an in-memory tree doesn’t really work.  Instead, the parser must be a so-called “pull parser”, which only parses each token upon request.  The parser only does exactly what work you ask of it, nothing more.

My initial designs for the parser attempted to follow the example set by the generator: each value type self-contained, responsible for parsing its own format.  As it turns out, this can be difficult to accomplish.  I could have expanded slightly on the parser combinator concept, but monads are very clumsy to achieve in Java, which led me to rule out that option.  In the end, I took a middle ground.

Click for full size

As before, a common superinterface sits above the entire representative hierarchy.  To understand this hierarchy a little better, perhaps it would be helpful to look at the full source for Value:

public interface Value<T> {
    public T resolve() throws IOException;
    public boolean isResolved();
}

The resolve() method is really the core of the entire parser.  The concept is that each value will be able to consume the bytes necessary to determine its own value, which is converted and returned.  This is extremely convenient because it enables VariantValue(s) (such as string) to carry the logic for parsing to a specific length, rather than the conventional e terminator.  In order to avoid clogging up memory, the return value of resolve() should not be memoized (though, there is nothing in the framework to prevent it).  Conventionally, values which are already resolved should throw an exception if they are resolved a second time.  This prevents the framework from holding onto values which are no longer needed.

You will also notice that CompositeValue not only inherits from Value, but also from the JDK interface, Iterable.  Logically, a composite value is a linear collection of values, consumed one at a time.  To me, that sounds a lot like a unidirectional iterator.  We can, of course, resolve the entire composite at once, mindlessly consuming all of its values, but since all of the values are lost once consumed, the only purpose for such an action would be if we know that we don’t care about a particular composite and we just want to rapidly skip to the next value in the stream.

Returning to primitive values, the resolve() method for IntegerValue is worthy of note, not so much for its uniqueness, but because it is very similar to the parsing technique used in all the other values:

public Long resolve() throws IOException {
    if (resolved) {
        throw new IOException("Value already resolved");
    }
    resolved = true;
 
    boolean negative = false;
    long value = 0;
 
    int b = 0;
    while ((b = is.read()) >= 0) {
        int digit = b - '0';
 
        if (digit < 0 || digit > 9) {
            if (b == '-') {
                negative = true;
            } else if (b == 'e') {
                break;
            } else {
                throw new IOException("Unexpected character in integer value: " 
                    + Character.forDigit(b, 10));
            }
        } else {
            value = (value * 10) + digit;
        }
    }
 
    if (negative) {
        value *= -1;
    }
 
    return value;
}

The i prefix itself is consumed before control flow even enters this method.  This is because the prefix is required to determine the appropriate value implementation to use.  Specifically, the logic to perform this determination is contained within the Parser class, which maintains a map of Value(s) and their associated prefixes.  String values have special logic associated with them, as they do not have a prefix.

As with most hand-coded parsers, this one operates on the principle of “eat until it hurts”.  We start out by assuming that the integer value extends to the end of the stream, then we set about to find a premature end to the integer, at which point we break out and call it a day.  Since we are moving from left to right through a base-10 integer, we must multiply the current accumulator by 10 prior to adding the new digit. 

Actually, the real heart of the parser framework is CompositeValue.  This class is inherited by Parser to define a special value encompassing the stream itself (which is viewed as a composite value with no delimiters and only a single child).  This unification allows us to keep the code for parsing a composite stream in a single location.  This implementation is a little less concise than the code for parsing an integer, but it follows the same pattern and is fairly instructive:

protected final Value<?> parse() throws IOException {
    if (resolved) {
        throw new IOException("Composite value already resolved");
    }
 
    if (previous != null) {
        if (!previous.isResolved()) {
            previous.resolve();        // ensure we're at the right spot in the stream
        }
    }
 
    byte b = -1;
    if (readAhead instanceof Some) {
        b = readAhead.value();
        readAhead = new None<Byte>();
    } else {
        b = read();
    }
 
    if (b >= 0) {
        Class<? extends Value<?>> valueType = parser.getValueType(b);
 
        if (valueType != null) {
            return previous = Parser.createValue(valueType, parser, is);
        } else if (b > '0' && b <= '9') {
            return previous = readString(b - '0');
        } else if (b == ' ' || b == '\n' || b == '\r' || b == '\t') {
            return parse();        // loop state
        } else {
            throw new IOException("Unexpected character in the parse stream: " 
                + Character.forDigit(b, 10));
        }
    }
 
    throw new IOException("Unexpected end of stream in composite value");
}
 
private final StringValue readString(long length) throws IOException {
    int i = is.read();
 
    if (i >= 0) {
        byte b = (byte) i;
 
        if (b == ':') {
            return Parser.createValue(StringValue.class, parser, 
                new SubStream(is, length));
        } else if (b >= '0' && b <= '9') {
            return readString((length * 10) + b - '0');
        } else {
            throw new IOException("Unexpected character in string value: " 
                + Character.forDigit(i, 10));
        }
    }
 
    throw new IOException("Unexpected end of stream in string value");
}

It seems a bit imposing, but really this code is more of the same logic we saw previously when dealing with integers.  The only value type which really gives us trouble here is string.  We can’t simply treat it like the others because it has no prefix.  For this reason, we must assume that any unbound integer is an inclusive prefix for a string.  In most parser implementations, this would require backtracking, but because we are doing this by hand, we can condense the backtrack into an inherited parameter (borrowing terminology from attribute grammars), avoiding the performance hit.

There’s one final bit of weirdness which deserves attention before we bail on this small epic: dictionary values.  Intuitively, a dictionary value should be parsed into a Java Map, or some sort of associative data structure.  Unfortunately, a map is by definition a random access data structure.  Since we are dealing with a sequential bencode stream, the only recourse to satisfy this property would be to page the entire dictionary into memory.  This of course violates one of the primary requirements which is to avoid using more memory than necessary.

The solution I eventually chose to this problem was to limit dictionary access to sequential, which translates into alphabetical given the nature of bencode dictionaries.  Thus, a dictionary can be parsed in the same way as a list, where each element is a sequential key and value, jointly represented by EntryValue.  To make usage patterns slightly easier, EntryValue memoizes the key and value.  Due to the fact that both of these objects are themselves Value(s), this does not lead to inadvertent memory bloat.

Conclusion

Hopefully the parser and generator presented here will be of some utility in situations where you have to parse large volumes of bencoded data.  The API is (admittedly) bizarre and difficult to deal with, but the performance results are difficult to deny.  This framework is currently deployed in production, where benchmarks have shown that it imposes little-to-no runtime overhead, and practically zero memory overhead (despite the sizeable amounts of data being processed).

For convenience, I actually created a Google Code project for this framework so as to facilitate its development internally to the project I was working on.  The end result of this is unlike most of my experiments, there is actually a proper SVN from which the source may be obtained!  A packaged JAR may be obtained from the downloads section.

The Brilliance of BDD

9
Jun
2008

As I have previously written, I have recently been spending some time experimenting with various aspects of Scala, including some of the frameworks which have become available.  One of the frameworks I have had the privilege of using is the somewhat unassumingly-titled Specs, and implementation of the behavior-driven development methodology in Scala.

Specs takes full advantage of Scala’s flexible syntax, offering a very natural format for structuring tests.  For example, we could write a simple specification for a hypothetical add method in the following way:

object AddSpec extends Specification {
  "add method" should {
    "handle simple positives" in {
      add(1, 2) mustEqual 3
    }
 
    "handle simple negatives" in {
      add(-1, -2) mustEqual -3
    }
 
    "handle mixed signs" in {
      add(1, -2) mustEqual -1
      add(-1, 2) mustEqual 1
    }
  }
}

We could go on, of course, but you get the picture.  This code will lead to the execution of four separate assertions in three tests (to put things into JUnit terminology).  Fundamentally, this isn’t too much different than a standard series of unit tests, just with a slightly nicer syntax.

Specs defines a domain-specific language for structuring test assertions in a simple and intuitive way.  However, this is hardly the only framework for BDD.  Perhaps the most well-known such framework is RSpec, which answers a similar use-case in the Ruby programming language.  Our previous specification could be rewritten using RSpec as follows:

describe AddLib do
  it 'should handle simple positives' do
    add(1, 2).should == 3
  end
 
  it 'should handle simple negatives' do
    add(-1, -2).should == -3
  end
 
  it 'should handle mixed signs' do
    add(1, -2).should == -1
    add(-1, 2).should == 1
  end
end

The end-result is basically the same: the add method will be tested against the given assertions (all four of them) and the results printed in some sort of report form.  In this area, RSpec is significantly more mature than Specs, generating very slick HTML reports and nicely formatted console output.  This isn’t really a fundamental weakness of the Specs framework however, just indicative of the fact that RSpec has been around for a lot longer.

These two frameworks are interesting of course, but they are merely implementations of a much larger concept: behavior-driven development.  I’ve never been much of a fan of unit testing.  It’s always seemed to be incredibly dull and a very nearly fruitless waste of effort.  As much as I hate it though, I have to bow to the benefits of a self-contained test suite; and so I press on, cursing JUnit every step of the way.

BDD provides a nice alternative to unit testing.  At its core, it is not much different in that test groupings and primitive assertions are used to check all aspects of a test unit against predefined data.  However, there is something about the “flow” of a behavioral spec that is considerably easier to deal with.  For some reason, it is far less painful to devise a comprehensive test suite using BDD principles than conventional unit testing.  It seems a little far-fetched, but BDD actually makes it easier to write (and more importantly, formulate) exactly the same tests.

It’s an odd phenomenon, one which can only be caused by the storyboard flow of the code itself.  It is very natural to think of distinct requirements for a test unit when each of these requirements are being labeled and entered in a logical sequence.  Moreover, the syntax of both Specs and RSpec is such that there is very little boiler-plate required to setup an additional test.  Compare the previous BDD specs with the following JUnit4 example:

public class MathTest {
    @Test
    public void testSimplePositives() {
        assertEquals(3, add(1, 2));
    }
 
    @Test
    public void testSimpleNegatives() {
        assertEquals(-3, add(-1, -2);
    }
 
    @Test
    public void testMixedSigns() {
        assertEquals(-1, add(1, -2));
        assertEquals(1, add(-1, 2));
    }
}

JUnit just requires that much more syntax.  It breaks up the logical flow of the tests and (more importantly) the developer train of thought.  What can be worse is this syntax bloat makes it very tempting to just group all of the assertions into a single test - to save typing if nothing else.  This is problematic because one assertion may shadow all the others in the case of a failure, preventing them from ever being executed.  This can make certain problems much more difficult to isolate.

Logical flow is extremely important to test structure.  BDD frameworks provide a very nice syntax for painlessly defining comprehensive test suites.  The really wonderful thing about all of this is that BDD is available on the JVM, right now.  There’s nothing stopping you from writing your code in Java as you normally would, then creating your test suite in Scala using Specs rather than JUnit.  Alternatively, you could use RSpec on top of JRuby, or Gspec with Groovy.  All of these are seamless replacements for a test framework like JUnit, and requiring of far less syntactic overhead.

The growing move toward polyglot programming encourages the use of a separate language when it is best suited to a particular task.  In this case, several languages are available which offer far more powerful test frameworks than those which can be found in Java.  Why not take advantage of them?

The Plague of Polyglotism

28
Apr
2008

For those of you who don’t know, polyglotism is not some weird religion but actually a growing trend in the programming industry.  In essence, it is the concept that one should not be confined to a single language for a given system or even a specific application.  With polyglot programming, a single project could use dozens of different languages, each for a different task to which they are uniquely well-suited.

As a basic example, we could write a Wicket component which makes use of Ruby’s RedCloth library for working with Textile.  Because of Scala’s flexible syntax, we can use it to perform the interop between Wicket and Ruby using an internal DSL:

class TextileLabel(id:String, model:IModel) extends WebComponent(id, model) with JRuby {
  require("textile_utils")
 
  override def onComponentTagBody(stream:MarkupStream, openTag:ComponentTag) {
    replaceComponentTagBody(markupStream, openTag, 
        'textilize(model.getObject().toString()))
  }
}
# textile_utils.rb
require 'redcloth'
 
def textilize(text)
  doc = RedCloth.new text
  doc.to_html
end

Warning: Untested code

We’re actually using three languages here, even though we only have source for two of them.  The Wicket library itself is written in Java, our component is written in Scala and we work with the RedCloth library in Ruby.    This is hardly the best example of polyglotism, but it suffices for a simple illustration.  The general idea is that you would apply this concept to a more serious project and perform more significant tasks in each of the various languages.

The Bad News

This is all well and good, but there’s a glaring problem with this philosophy of design: not everyone knows every language.  You may be a language aficionado, picking up everything from Scheme to Objective-C, but it’s only a very small percentage of developers who share that passion.  Many projects are composed of developers without extensive knowledge of diverse languages.  In fact, even with a really good sampling of talent, it’s doubtful you’ll have more than one or two people fluent in more than two languages.  And unfortunately, there’s this pesky concern we all love called “maintainability”.

Let’s pretend that Slava Pestov comes into your project as a consultant and decides that he’s going to apply the polyglot programming philosophy.  He writes a good portion of your application in Java, Lisp and some language called Factor, pockets his consultant’s fee and then moves on.  Now the code he wrote may have been phenomenally well-designed and really perfect for the task, but you’re going to have a very hard time finding a developer who can maintain it.  Let’s say that six months down the road, you decide that your widget really needs a red push button, rather than a green radio selector.  Either you need a developer who knows Factor (hint: there aren’t very many), or you need a developer who’s willing to learn it.  The thing is that most developers with the knowledge and motivation to learn a language have either already done so, or are familiar enough with the base concepts as to be capable of jumping right in.  These developers fall into that limited group of people fluent in many different languages, and as such are a rare find.

Now I’m not picking on Factor in any way, it’s a very interesting language, but it still isn’t very widespread in terms of developer expertise.  That’s really what this all comes down to: developer expertise.  Every time you make a language choice, you limit the pool of developers who are even capable of groking your code.  If I decide to build an application in Java, even assuming that’s the only language I use, I have still eliminated maybe 20% of all developers from ever touching the project.  If I make the decision to use Ruby for some parts of the application, while still using Java for the others, I’ve now factored that 80% down to maybe 35% (developers who know Java and Ruby).  Once I throw in Scala, that cuts it down still further (maybe at 15% now).  If I add a fourth language - for example, Haskell - I’ve now narrowed the field so far, that it’s doubtful I’ll find anyone capable of handling all aspects within a reasonable price range.  It’s the same problem as with framework choice, except that frameworks are much easier to learn than languages.

The polyglot ideal was really devised by a bunch of nerdy folks like me.  I love languages and would like nothing better than to get paid to learn half a dozen new ones (assuming I’m coming into a project with a strange combination I haven’t seen before).  However, as I understand the industry, that’s not a common sentiment.  So a very loud minority of developers (/me waves) has managed to forge a very hot methodology, one which excludes almost all of the hard-working developer community.  If I didn’t know better, I would be tempted to say that it was a self-serving industry ploy to foster exclusivity in the job market.

I want to work on multi-language projects as much as anyone, but I really don’t think it’s the best thing right now.  I’m working on a project now which has an aspect for which Scala would be absolutely perfect, but since I’m the only developer on hand who is remotely familiar with the language, I’m probably going to end up recommending against its adoption.  Consider carefully the ramifications of trying new languages on your own projects, you may not be doing future developers any favors by going down that path.

Defining Scala Design Idioms

14
Apr
2008

With any new language comes a certain amount of uncertainty as to what is “the right way” to do things.  Now I’m not just talking about recursion vs iteration or object-oriented vs straight procedural.  What I’m referring to is the design idioms which govern everything from naming conventions to array deconstruction.

This idioms are highly language specific and can even differ between languages with otherwise similar lexical elements.  Consider the following C++, Java, Ruby and Scala examples:

vector<string> first_names;
first_names.push_back("Daniel");
first_names.push_back("Chris");
first_names.push_back("Joseph");
 
for (vector<string>::iterator i = first_names.begin(); i != first_names.end(); ++i) {
    cout << *i << endl;
}

Java:

String[] firstNames = {"Daniel", "Chris", "Joseph"};
 
for (String name : firstNames) {
    System.out.println(name);
}

Ruby:

first_names = ['Daniel', 'Chris', 'Joseph']
 
first_names.each do |name|
  puts name
end

Scala:

val firstNames = Array("Daniel", "Chris", "Joseph")
 
for (name <- firstNames) {
  println(name)
}

As a matter of interest, the Scala example could also be shortened to the following:

val firstNames = Array("Daniel", "Chris", "Joseph")
firstNames.foreach(println)

All of these samples perform essentially the same task: traverse an array of strings and print each value to stdout.  Of course, the C++ example is actually using a vector rather than an array due to the evil nature of C/C++ arrays, but it comes to the same thing.  Passing over the differences in syntax between these four languages, what really stands out are the different ways in which the task is performed.  C++ and Java are both using iterators, while Ruby and Scala are making use of higher order functions.  Ruby and C++ both use lowercase variables separated by underscores, while Java and Scala share the camelCase convention.

This is a bit of a trivial example, but it does open the door to a much more interesting discussion: what are these idioms in Scala’s case?  Scala is a very new language which has yet to see truly wide-spread adoption.  More than that, Scala is fundamentally different from what has come before.  Certainly it draws inspiration from many languages - most notably Java, C# and Haskell - but even languages which are direct descendants can impose entirely different idioms.  Just look at the differences between the Java and the C++ examples above.  The practical, day-to-day implications of this become even more apparent when you consider object-oriented constructs:

// Book.h
class Book {
    std::string title;
    Author *author;
 
public:
    Book(string);
 
    std::string get_title();
    Author* get_author();
    void set_author(Author*);
};
 
// Book.cpp
Book::Book(string t) : title(t), author(0) {}
 
string Book::get_title() {
    return title;
}
 
Author* Book::get_author() {
    return author;
}
 
void Book::set_author(Author *a) {
    author = a;
}

The equivalent in Java:

public class Book {
    private String title;
    private Author author;
 
    public Book(String title) {
        this.title = title;
    }
 
    public String getTitle() {
        return title;
    }
 
    public Author getAuthor() {
        return author;
    }
 
    public void setAuthor(Author author) {
        this.author = author;
    }
}

And the (much shorter) Ruby code:

class Book
  attr_reader :title, :author
  attr_writer :author
 
  def initialize(title)
    @title = title
  end
end

This code uses standard, accepted idioms for class design in each of the three languages.  Notice how in C++ we avoid name-clashes between method formals and data members?  This is because the compiler tends to get a little confused if we try to combine name shadowing and the initialization syntax (the bit that follows the : in the constructor).  This contrasts strongly with Java which by convention encourages shadowing of fields by method formals.  It’s a very strong convention to do what I’ve done here, using this.fieldName to disambiguate between formals and fields.  Ruby stands apart from both of these languages by enforcing the use of prefix symbols on variable names to define the container.  There really can’t be any shadowing of instance variables by formals since all instance variables must be prefixed with the @ symbol.

The question here is: what conventions are applicable in Scala?  At first blush, it’s tempting to write a class which looks like this:

class Book(val title:String) {
  var author:Author = null
}

Because every variable/value in Scala is actually a method, the accessor/mutator pairs are already generated for us (similar to how it is in Ruby with attributes).  The problem here is of course redefining an accessor or mutator.  For example, we may want to perform a check in the author mutator to ensure that the new value is not null.  In C++ and Java, we would just add an if statement to the correct method and leave it at that.  Ruby is more interesting because of the auto-generated methods, but it’s still fairly easy to do:

class Book
  def author=(author)
    @author = author unless author.nil?
  end
end

Scala poses a different problem.  In our example above, we’ve essentially created a class with two public fields, something which is sternly frowned upon in most object-oriented languages.  If we had done this in Java, it would be impossible to implement the null check without changing the public interface of the class.  Fortunately Scala is more flexible.  Here’s a naive implementation of Book in Scala which performs the appropriate check:

class Book(val title:String) {
  private var author:Author = null
 
  def author = author
 
  def author_=(author:Author) {
    if (author != null) {
      this.author = author
    }
  }
}

Unfortunately, there are some fairly significant issues with the above code.  First off, it won’t compile.  Scala lacks separate namespaces for fields and methods, so you can’t just give a method the same name as a field and expect it to work.  This merging of namespaces actually allows a lot of interesting design and is on the whole a good thing, but it puts a serious crimp in our design.  The second problem with this example is closely related to the first.  Assuming the code did compile, at runtime execution would enter the author_= method, presumably pass the conditional and then execute the this.author = author statement.  However, the Scala compiler will interpret this line in the following way:

this.author_=(author)

That’s right, infinite recursion.  Usually this issue isn’t apparent because the compiler error will take precedence over the runtime design flaw, but it’s still something which merits notice.  Obviously, using the same names for variables as we do for methods and their formals just isn’t going to work here.  It may be the convention in Java, but we’ll have to devise something new for Scala.

Over the last few months, I’ve read a lot of Scala written by other people.  Design strategies to solve problems like these range all over the map, from totally separate names to prepending or appending characters, etc.  The community just doesn’t seem to be able to standardize on any one “right way”.  Personally, I favor the prepended underscore solution and would really like to see it become the convention, but that’s just me:

class Book(val title:String) {
  private var _author:Author = null   // notice the leading underscore
 
  def author = _author
 
  def author_=(author:Author) {
    if (author != null) {
      _author = author
    }
  }
}

The Scala community really needs to get together on this and other issues related to conventional design idioms.  I see a lot of code that’s written in Java, but with C or C++ idioms.  This sort of thing is rare nowadays, but was quite common in the early days of the language.  People weren’t sure what worked best in Java, so they tried to apply old techniques to the new syntax, often with disastrously unreadable results.  As Scala moves into the mainstream, we have to come to some sort of consensus as to what our code should look like and what conventions apply.  If we don’t, then the language will be forced to struggle through many of the same problems which plagued Java only a decade ago: new developers trying to get a handle on this foreign language, misapplying familiar constructs along the way.