Skip to content

Bencode Stream Parsing in Java

15
Jul
2008

It’s surprising how universal XML has become.  It doesn’t seem to matter what the problem, XML is the solution.  For example, consider a simple client/server architecture where the communication protocol must transmit some sort of structured data.  Nine developers out of ten will form the basis of the protocol around XML.  If it’s a lot of data to be transferred, then they will compress the XML using Java’s stream compression libraries.  If there’s binary data to be transmitted, it will either be stored as CDATA within the XML or as files within the same compressed archive.  Very few developers will actually stop and consider alternative solutions.

One such “alternative solution” is bencode (pronounced “bee-encode”).  Similar to formats like XML and JSON, bencode defines a series of constructs which may be used to encode arbitrarily complex data.  However, unlike XML, the design focus of the format was not to produce verbose, human-readable documents, but rather to encode data in the most concise manner possible.  To that end, the core bencode specification only includes four data types, two simple and two composite structures.  These types are defined with an almost complete absence of meta, requiring very little “structure” to clutter the data stream.

Unfortunately, outside of applications like BitTorrent, this elegant binary format has seen remarkably little adoption.  Because of this state of affairs, it can be extremely difficult to find libraries to actually process bencode data.  Not too long ago, I ran into a production use-case which required both parsing and generation of bencode-formatted files.  I considered digging into the source code for Vuze (nee “Azureus”), but a) it seemed like a lot of boring, nearly-wasted effort, and b) I strongly suspect that their bencode parser and generator are extremely space inefficient, since the data sources which they deal with are remarkably small.

The second hang-up was really a more significant motivator than the first, due to the fact that I knew I would be dealing with bencode streams potentially gigabytes in size.  So, rather than fruitlessly dig through someone else’s code, I decided to put all of this formal parser theory to work and roll my own library.  Unless you’re already familiar with bencode, I suggest you read the Wikipedia article to get a feel for the format, otherwise some of what I will be talking about will make no sense at all.  :-)

The first thing I needed to do was build the generation half of the library.  I decided that it would be easier if I avoided trying to use the same backend framework classes with both the generator and the parser.  For example, there are actually two classes in the framework which contain the logic for handling an integer: IntegerValue and IntegerType.  The former is for use in the parser, while the latter is for use in the generator.  This separation of logic may seem a little strange, but it actually simplifies things tremendously.

Remember my primary requirement: extremely efficient implementation of both generator and parser, especially with respect to space.  If I attempted to use the same classes to represent data for both the parser and the generator, then the parser would be forced to read the entire stream into some sort of in-memory representation (think about it; it’s actually true).  Obviously, this is unacceptable for streams that are gigabytes in size, so the traditional “good design” from an object-oriented standpoint was out.

Stream Generation

Since I needed the functionality of bencode stream generation before I needed parsing, I started with that aspect of the framework.  Here again, the most obvious “object-oriented” approach would have been the wrong one.  When we think of generating output in a structured format programmatically, we naturally imagine a DOM-like tree representation (preferably framework-agnostic) which is then walked by the framework to produce the output.  The major disadvantage to this approach is that it requires paging everything into memory.  This works for smaller applications or situations where the data is already in memory, but for my particular use-case, it would have been disastrous.

The only way to avoid paging everything into memory for stream generation is to structure the API so that the data is “pulled” by the generator, rather than “pushed” to it in tree-form.  In other words, the data itself has to be lazy-loaded, using callbacks to grab the data as-needed and hold it in memory only as long as is absolutely necessary.  In a functional language, this would be done with closures (or even normal data types in a pure-functional language).  However, as we all know, Java does not support such time-saving features.  The only recourse is to use abstract classes and interfaces which can be overridden in anonymous inner-classes as well as top-level classes as necessary.

image

After a bit of experimentation, the finalized hierarchy looks something like this.  Logically, every type must be able to query its abstract method for data of a certain Java type (long for IntegerType, InputStream for StringType, etc), convert this data into bencode with the appropriate meta, and then write the result to a given OutputStream.  Also following our nose, we see the semantic differences between composite and primitive types are really quite limited, especially if we simplify everything to a black box “get data / write encoding” methodology.  In fact, the only thing that CompositeType actually does is enforce the prefix/suffix encoding of every composite type.  Since this is in compliance with the bencode specification, we are safe in extracting this functionality into a superclass.

The more interesting distinction is between so-called “variant” and “invariant” types.  This is where you should begin to notice that I have over-engineered this library to some degree.  If I was just trying to create a pure bencode generator, then I could have skipped InvariantPrimitiveType and VariantPrimitiveType and just let IntegerType and StringType extend PrimitiveType directly.  This comes back to my initial requirements.

Priority one was to create a framework which was blazingly fast, but priority two was to ensure that it was extensible at the type level.  For the particular application I was interested in, I required more than just the core bencode types.  Also on the agenda were proper UTF-8 strings, dates, and support for null.  To accommodate all of this without too much code duplication, I knew I would have to extract a lot of the functionality into generic superclasses.  Hence my somewhat incorrect use of the terms “variant” and “invariant” to describe the difference between the integer type - which is prefix/suffix delimited - and the string type - which defines a length as its prefix and has no closing suffix.

Anyway, back to the problem at hand.  In addition to the CompositeType and PrimitiveType, you should also notice EntryType.  This “extra” type exists to handle the fact that bencode dictionaries are extremely weird and sit rather outside the “common functionality” umbrella of the format in general.  For one thing, the specification requires that dictionary entries be sorted by key, obviously implying some sort of Comparable relation.  Moreover, these keys must be themselves strings, but StringType isn’t comparable because its writeValue(OutputStream) method doesn’t return the data in question, but merely writes it to a given OutputStream.  Are we starting to see the problems with space-efficient implementations?

Enough babble though, let’s see some code!  Here’s how we might encode some very simple data using the generator framework:

public class GeneratorTest {
    public static void main(String[] args) {
        ByteArrayOutputStream os = new ByteArrayOutputStream();
        final byte[] picture = new byte[0];        // presumably something interesting
 
        DictionaryType root = new DictionaryType() {
            @Override
            protected void populate(SortedSet<EntryType<?>> entries) {
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("name"), 
                        new LiteralStringType("Arthur Dent")));
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("number"), 
                        new IntegerType(42)));
 
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("picture"), 
                        new StringType() {
 
                    @Override
                    protected long getLength() {
                        return picture.length;
                    }
 
                    @Override
                    protected void writeValue(OutputStream os) throws IOException {
                        os.write(picture);
                    }
                }));
 
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("planets"), 
                        new ListType() {
 
                    @Override
                    protected void populate(ListTypeStream list) throws IOException {
                        list.add(new LiteralStringType("Earth"));
                        list.add(new LiteralStringType("Somewhere else"));
                        list.add(new LiteralStringType("Old Earth"));
                    }
                }));
            }
        };
 
        try {
            root.write(os);
        } catch (IOException e) {
            e.printStackTrace();
        }
 
        System.out.println(new String(os.toByteArray()));
    }
 
    private static class LiteralStringType extends StringType 
            implements Comparable<LiteralStringType> {
        private final String value;
 
        public LiteralStringType(String value) {
            this.value = value;
        }
 
        @Override
        protected long getLength() {
            return value.length();
        }
 
        @Override
        protected void writeValue(OutputStream os) throws IOException {
            os.write(value.getBytes("US-ASCII"));
        }
 
        public int compareTo(LiteralStringType o) {
            return o.value.compareTo(value);
        }
    }
}

It’s hard to imagine why some people claim that Java is a verbose language…

The API may seem a little clumsy, but most of that is caused by the conniptions required to make the generator lazily pull the data, rather than paging it all into memory ahead of time.  Throwing that aside, the rest of the verbosity seems to come from the need for LiteralStringType, rather than just having a StringType which could handle this for us.  The reason for this extra headache is shown in the population of the “picture” field, which presumably may contain several megabytes worth of data from some external source such as a file or database (in this case of course, it doesn’t contain anything, but that’s besides the point).

The result of the above is as follows:

d4:name11:Arthur Dent6:numberi42e7:picture0:7:planetsl5:Earth14:Somewhere else9:Old Earthee

Or, with a little formatting to make it more palatable:

d
  4:name
  11:Arthur Dent

  6:number
  i42e

  7:picture
  0:

  7:planets
  l
    5:Earth
    14:Somewhere else
    9:Old Earth
  e
e

Technically, this is no longer valid bencode, but it is much easier to read this way.

The Parser

With all this bustle surrounding the generator, it’s easy to forget about the inverse process: parsing.  As it turns out, this is both easier and far less elegant than the solution for the generator (I know, it’s a sad state of affairs when the above is considered “elegant”).  Here again, there was a need for the parser to be extremely efficient, especially in terms of memory.  Thus, the logical approach of simply parsing the stream into an in-memory tree doesn’t really work.  Instead, the parser must be a so-called “pull parser”, which only parses each token upon request.  The parser only does exactly what work you ask of it, nothing more.

My initial designs for the parser attempted to follow the example set by the generator: each value type self-contained, responsible for parsing its own format.  As it turns out, this can be difficult to accomplish.  I could have expanded slightly on the parser combinator concept, but monads are very clumsy to achieve in Java, which led me to rule out that option.  In the end, I took a middle ground.

Click for full size

As before, a common superinterface sits above the entire representative hierarchy.  To understand this hierarchy a little better, perhaps it would be helpful to look at the full source for Value:

public interface Value<T> {
    public T resolve() throws IOException;
    public boolean isResolved();
}

The resolve() method is really the core of the entire parser.  The concept is that each value will be able to consume the bytes necessary to determine its own value, which is converted and returned.  This is extremely convenient because it enables VariantValue(s) (such as string) to carry the logic for parsing to a specific length, rather than the conventional e terminator.  In order to avoid clogging up memory, the return value of resolve() should not be memoized (though, there is nothing in the framework to prevent it).  Conventionally, values which are already resolved should throw an exception if they are resolved a second time.  This prevents the framework from holding onto values which are no longer needed.

You will also notice that CompositeValue not only inherits from Value, but also from the JDK interface, Iterable.  Logically, a composite value is a linear collection of values, consumed one at a time.  To me, that sounds a lot like a unidirectional iterator.  We can, of course, resolve the entire composite at once, mindlessly consuming all of its values, but since all of the values are lost once consumed, the only purpose for such an action would be if we know that we don’t care about a particular composite and we just want to rapidly skip to the next value in the stream.

Returning to primitive values, the resolve() method for IntegerValue is worthy of note, not so much for its uniqueness, but because it is very similar to the parsing technique used in all the other values:

public Long resolve() throws IOException {
    if (resolved) {
        throw new IOException("Value already resolved");
    }
    resolved = true;
 
    boolean negative = false;
    long value = 0;
 
    int b = 0;
    while ((b = is.read()) >= 0) {
        int digit = b - '0';
 
        if (digit < 0 || digit > 9) {
            if (b == '-') {
                negative = true;
            } else if (b == 'e') {
                break;
            } else {
                throw new IOException("Unexpected character in integer value: " 
                    + Character.forDigit(b, 10));
            }
        } else {
            value = (value * 10) + digit;
        }
    }
 
    if (negative) {
        value *= -1;
    }
 
    return value;
}

The i prefix itself is consumed before control flow even enters this method.  This is because the prefix is required to determine the appropriate value implementation to use.  Specifically, the logic to perform this determination is contained within the Parser class, which maintains a map of Value(s) and their associated prefixes.  String values have special logic associated with them, as they do not have a prefix.

As with most hand-coded parsers, this one operates on the principle of “eat until it hurts”.  We start out by assuming that the integer value extends to the end of the stream, then we set about to find a premature end to the integer, at which point we break out and call it a day.  Since we are moving from left to right through a base-10 integer, we must multiply the current accumulator by 10 prior to adding the new digit. 

Actually, the real heart of the parser framework is CompositeValue.  This class is inherited by Parser to define a special value encompassing the stream itself (which is viewed as a composite value with no delimiters and only a single child).  This unification allows us to keep the code for parsing a composite stream in a single location.  This implementation is a little less concise than the code for parsing an integer, but it follows the same pattern and is fairly instructive:

protected final Value<?> parse() throws IOException {
    if (resolved) {
        throw new IOException("Composite value already resolved");
    }
 
    if (previous != null) {
        if (!previous.isResolved()) {
            previous.resolve();        // ensure we're at the right spot in the stream
        }
    }
 
    byte b = -1;
    if (readAhead instanceof Some) {
        b = readAhead.value();
        readAhead = new None<Byte>();
    } else {
        b = read();
    }
 
    if (b >= 0) {
        Class<? extends Value<?>> valueType = parser.getValueType(b);
 
        if (valueType != null) {
            return previous = Parser.createValue(valueType, parser, is);
        } else if (b > '0' && b <= '9') {
            return previous = readString(b - '0');
        } else if (b == ' ' || b == '\n' || b == '\r' || b == '\t') {
            return parse();        // loop state
        } else {
            throw new IOException("Unexpected character in the parse stream: " 
                + Character.forDigit(b, 10));
        }
    }
 
    throw new IOException("Unexpected end of stream in composite value");
}
 
private final StringValue readString(long length) throws IOException {
    int i = is.read();
 
    if (i >= 0) {
        byte b = (byte) i;
 
        if (b == ':') {
            return Parser.createValue(StringValue.class, parser, 
                new SubStream(is, length));
        } else if (b >= '0' && b <= '9') {
            return readString((length * 10) + b - '0');
        } else {
            throw new IOException("Unexpected character in string value: " 
                + Character.forDigit(i, 10));
        }
    }
 
    throw new IOException("Unexpected end of stream in string value");
}

It seems a bit imposing, but really this code is more of the same logic we saw previously when dealing with integers.  The only value type which really gives us trouble here is string.  We can’t simply treat it like the others because it has no prefix.  For this reason, we must assume that any unbound integer is an inclusive prefix for a string.  In most parser implementations, this would require backtracking, but because we are doing this by hand, we can condense the backtrack into an inherited parameter (borrowing terminology from attribute grammars), avoiding the performance hit.

There’s one final bit of weirdness which deserves attention before we bail on this small epic: dictionary values.  Intuitively, a dictionary value should be parsed into a Java Map, or some sort of associative data structure.  Unfortunately, a map is by definition a random access data structure.  Since we are dealing with a sequential bencode stream, the only recourse to satisfy this property would be to page the entire dictionary into memory.  This of course violates one of the primary requirements which is to avoid using more memory than necessary.

The solution I eventually chose to this problem was to limit dictionary access to sequential, which translates into alphabetical given the nature of bencode dictionaries.  Thus, a dictionary can be parsed in the same way as a list, where each element is a sequential key and value, jointly represented by EntryValue.  To make usage patterns slightly easier, EntryValue memoizes the key and value.  Due to the fact that both of these objects are themselves Value(s), this does not lead to inadvertent memory bloat.

Conclusion

Hopefully the parser and generator presented here will be of some utility in situations where you have to parse large volumes of bencoded data.  The API is (admittedly) bizarre and difficult to deal with, but the performance results are difficult to deny.  This framework is currently deployed in production, where benchmarks have shown that it imposes little-to-no runtime overhead, and practically zero memory overhead (despite the sizeable amounts of data being processed).

For convenience, I actually created a Google Code project for this framework so as to facilitate its development internally to the project I was working on.  The end result of this is unlike most of my experiments, there is actually a proper SVN from which the source may be obtained!  A packaged JAR may be obtained from the downloads section.

Groovy’s Performance is Not Subjective

31
Mar
2008

Ah, the saga of misinterpreted micro-benchmarks!  Developers advocating one technology over another have long used micro-benchmarks and trivial examples to illustrate their point.  This is most often seen in the area of language implementation, where performance is a critical (and often emotional) consideration.  The Computer Language Benchmarks Game is a good example of this.

With the rise of the JVM as a platform for language development, we’ve been seeing a lot of benchmarks directly comparing some of these languages with each other and with Java itself.  Actually, it seems that the popular trio (as far as benchmarking goes) are Scala, JRuby and Groovy.  It’s been a bit of a favorite past-time of the blogosphere to create benchmarks for these languages and then spin up controversial posts about the results.  Back in September, Darek Young wrote a post illustrating a ray tracer benchmark which really caught my eye:

Results

ray.java
12.89s

ray.scala
11.224s

ray.groovy
2h 31m 42s

I was expecting the Groovy code to run longer than the Scala code but was shocked at the actual difference. All three versions of the code produce identical images: (fullsize here)

Wow!  I’ve seen these results dozens of times (looking back at the post), but they never cease to startle me.  How could Groovy be that much slower than everything else?  Granted it is very much a dynamic language, compared to Java and Scala which are peers in static-land.  But still, this is a ray tracer we’re talking about!  There’s no meta-programming involved to muddle the scene, so a halfway-decent optimizer should be able to at least squeeze that gradient down to maybe 5x as slow, rather than a factor of 830.

If this were an isolated incident, I would probably just blow it off as bad benchmarking, or perhaps an odd corner case that trips badness in the Groovy runtime.  Then a week later, I read this post by Pete Knego:

Test Groovy Java Java vs Groovy (times faster)
Simple counter 8.450 150 56x
Binary tree building 19.500 2.580 7.6x
Binary tree traversing 2.530 76 33x
Prime numbers 43.270 1.170 37x

All [non-decimal] times are in milliseconds.

Well, this is really disappointing. I expected Groovy to be slower but not by that much. In order to understand where does such a performance hit come from we have to peek under the hood. The culprit of all this is of course Groovy’s dynamic nature, implemented as MOP. MOP is a way for Groovy to know which class elements (fields/properties, methods, interfaces, superclasses, etc..) are defined on an object and to have a way to alter that data or invoke it. The core of MOP are two methods defined on GroovyObject: get/setProperty() and invokeMethod(). This methods are called every time you access a field or call a method on a Groovy object and quite a lot of work is done behind the scenes. The internals of the MOP are listed in MetaClass interface and implemented in six different classes.

All of this is old news, so the question is: Why am I bringing this up now?  Well, I recently saw a post on Groovy Zone by none-other-than Rick Ross, talking about this very subject.  Rick’s post was in response to two posts (here and here), discussing ways to improve Groovy code performance by obfuscating code.  Final result?

This text is being written as I was changing and trying things, I gained 20s from
minor changes of which I lost track. :-) I am currently at 1m30s (down from the
original 4m and comparing with Java’s 4s).

I’m sorry, this is acceptable performance?  This is someone who’s spent time trying to optimize Groovy, and by his own admission, Groovy is 23x slower than the equivalent Java code.  Certainly this is a far cry from the 830x slower in the ray tracer benchmark, but in this case it’s simple string manipulation, rather than a mathematically intensive test.

Coming back to Rick’s entry, he looks at the conclusion and has this to say about it:

Language performance is highly overrated

Much is often made of the theoretical “performance” of a language based on benchmarks and arcane tests. There have even been cases where vendors have built cheats into their products specifically so they would score well on benchmarks. In the end, runtime execution speed is not as important a factor as a lot of people would think it is if they only read about performance comparisons. Other factors such as maintainability, interoperability, developer productivity and tool and library support are all very significant, too.

Wait a minute, that sounds a lot like something else I’ve read recently!  Maybe something like this:

Is picking out the few performance weaknesses the right way to judge the
overall speed of Groovy?

To me the Groovy performance is absolutely sufficient because of the
easy integration with Java. If something’s too slow, I do it in Java.
And Java compared to Python is in most cases much faster.

I appreciate the efforts of the Groovy team to improve the performance,
but if they wouldn’t, this would be no real problem to me. Groovy is the
grooviest language with a development team always having the simplicity
and elegance of the language usage in mind - and that counts to me.  :-)

This is almost a mantra for the Groovy proponents: performance is irrelevant.  What’s worse, is that the few times where they’ve been pinned down on a particular performance issue that’s obviously a problem, the response seems to be along the lines of: this test doesn’t really show anything, since micro-benchmarks are useless.

I’m sorry, but that’s a cop-out.  Face up to it, Groovy’s performance is terrible.  Anyone who claims otherwise is simply not looking at the evidence.  Oh, and if you’re going to claim that this is just a function of shoe-horning a dynamic language onto the JVM, check out a direct comparison between JRuby and and Groovy.  Groovy comes out ahead in only four of the tests.

What really bothers me about the Groovy performance debates is that most “Groovyists” seem to believe that performance is in the eye of the beholder.  The thought is that it’s all just a subjective issue and so should be discounted almost completely from the language selection process.  People who say this have obviously forgotten what it means to try to write a scalable non-trivial application which performs decently under load.  When you start getting hundreds of thousands of hits an hour, you’ll be willing to sell your soul for every last millisecond.

The fact is that this is simply not the case.  Performance is important and it must be considered when choosing a language for your next project.  Now I’m not implying that it should be the sole consideration, after all, we don’t use Assembly much anymore.  But we can’t just discard performance altogether as a means of evaluation.  Developer productivity is important, but it means nothing if the end result doesn’t meet basic responsiveness standards.

It’s like the first time I tried Mingle (which is written using JRuby BTW) back when it was still in pre-beta.  The application was amazing, but it took literally minutes to render the most basic of pages.  The problem was closely related to the state of the JRuby interpreter at the time and its poor memory management with respect to Rails.  In the end, the application was completely unusable.  ThoughtWorks had produced an amazing piece of software in a remarkably short time span, but the cost was the use of a language and framework which (at the time) led to unredeemable performance.  They spared the developers at the expense of the end-users.

Both Mingle and JRuby have come a long way since those first tests.  Charles and the gang have put an intense amount of effort into optimizing the JRuby runtime and JIT compiler.  They’ve gone from an implementation which was 5-10x slower than MRI 1.8.6 to a fully compatible implementation that is usually faster than the original.  Obviously performance is achievable in a dynamic language on the JVM, so why is Groovy still so horrible?

The only answer I can think of is that the Groovy core team just doesn’t value performance.  Why else would they consistently bury their heads in the sand, ignoring the issues even when the evidence is right in front of them?  It’s as if they have repeated their own “performance is irrelevant” mantra so many times that they are actually starting to believe it.  It’s unfortunate, because Groovy really is an interesting effort.  I may not see any value for my needs, but I can understand how a lot of people would.  It fills a nice syntactic niche that other languages (such as Ruby) just miss.  But all of its benefits are for naught if it can’t deliver when it counts.

Defining High, Mid and Low-Level Languages

27
Feb
2008

I’ve been writing quite a bit recently about the differences between languages.  Mostly I’ve just been whining about how annoying it is that everyone keeps searching for the “one language to rule them all”, the Aryan Language if you will.  Over the course of some of these articles, I’ve made some rather loosely defined references to terms like “general purpose” and “mid-level” when trying to describe these languages. 

Several people have (rightly) called me out on these terms, arguing that I haven’t really defined what they mean, so I shouldn’t be using them to try to argue a certain point.  In the case of “general purpose language”, I have to admit that I tend to horribly misuse the term and any instances within my writing should be discarded without thought.  However, I think with a little bit of reflection, we can come to some reasonable definitions for high-, mid- and low-level languages.  To that end, I present the “Language Spectrum of Science!” (cue reverb)

Language Spectrum of Science 

This scale is admittedly arbitrary and rather loosely defined in and of itself, but I think it should be a sufficient visual aid in conveying my point.  In case you hadn’t guessed, red languages are low-level, green languages are high-level and that narrow strip of yellow represents the mid-level languages.  Obviously I’m leaving out a large number of languages which could be represented with equal validity, but I only have a finite number of pixels in page-width.

The scale is also somewhat myopic.  It defines Ruby as the highest of the high-level languages.  Very few could argue the other side of the scale since there’s not really anything lower than the hardware, but claiming that Ruby is the most high-level language in history seems somewhat odd.  In truth, I picked Ruby as the super high-level language mainly because it’s a) more dynamic than both JavaScript and Perl, b) more prone to RAD frameworks like Rails and c) it’s the most significant high-level language which I’m really familiar with.

It’s also important to note that languages aren’t really points on the spectrum, but rather they span ranges which are more or less wide, depending on the capabilities.  These ranges may overlap considerably (as in the case of Java and Scala) or may be entirely disjoint (Assembly and Ruby).  In short, the scale is somewhat blurry and shouldn’t be taken as a canonical reference.

Low-Level

Of all of the categories, it’s probably easiest to define what it means to be a low-level language.  Machine code is low level because it runs directly on the processor.  Low-level languages are appropriate for writing operating systems or firmware for micro-controllers.  They can do just about anything with a little bit of work, but obviously you wouldn’t want to write the next major web framework in one of them (I can see it now, “Assembly on Rails”).

Characteristics

  • Direct memory management
  • Little-to-no abstraction from the hardware
  • Register access
  • Statements usually have an obvious correspondence with clock cycles
  • Superb performance

C is actually a very interesting language in this category (more so C++) because of how broad its range happens to be.  C allows you direct access to registers and memory locations, but it also has a number of constructs which allow significant abstraction from the hardware itself.  Really, C and C++ probably represent the most broad spectrum languages in existence, which makes them quite interesting from a theoretical standpoint.  In practice, both C and C++ are too low-level to do anything “enterprisy”.

Mid-Level

This is where things start getting vague.  Most high-level languages are well defined, as are low-level languages, but mid-level languages tend to be a bit difficult to box.  I really define the category by the size of application I would be willing to write using a given language.  I would have no problem writing and maintaining a large desktop application in a mid-level language (such as Java), whereas to do so in a low-level language (like Assembly) would lead to unending pain.

This is really the level at which virtual machines start to become common-place.  Java, Scala, C# etc all use a virtual machine to provide an execution environment.  Thus, many mid-level languages don’t compile directly down to the metal (at least, not right away) but represent a blurring between interpreted and compiled languages.  Mid-level languages are almost always defined in terms of low-level languages (e.g. the Java compiler is bootstrapped from C).

Characteristics

  • High level abstractions such as objects (or functionals)
  • Static typing
  • Extremely commonplace (mid-level languages are by far the most widely used)
  • Virtual machines
  • Garbage collection
  • Easy to reason about program flow

High-Level

High-level languages are really interesting if you think about it.  They are essentially mid-level languages which just take the concepts of abstraction and high-level constructs to the extreme.  For example, Java is mostly object-oriented, but it still relies on primitives which are represented directly in memory.  Ruby on the other hand is completely object-oriented.  It has no primitives (outside of the runtime implementation) and everything can be treated as an object.

In short, high-level languages are the logical semantic evolution of mid-level languages.  It makes a lot of sense when you consider the philosophy of simplification and increase of abstraction.  After all, people were n times more productive switching from C to Java with all of its abstractions.  If that really was the case, then can’t we just add more and more layers of abstraction to increase productivity exponentially?

High-level languages tend to be extremely dynamic.  Runtime flow is changed on the fly through the use of things like dynamic typing, open classes, etc.  This sort of technique provides a tremendous amount of flexibility in algorithm design.  However, this sort of mucking about with execution also tends to make the programs harder to reason about.  It can be very difficult to follow the flow of an algorithm written in Ruby.  This “obfuscation of flow” is precisely why I don’t think high-level languages like Ruby are suitable for large applications.  That’s just my opinion though.  :-)

Characteristics

  • Interpreted
  • Dynamic constructs (open classes, message-style methods, etc)
  • Poor performance
  • Concise code
  • Flexible syntax (good for internal DSLs)
  • Hybrid paradigm (object-oriented and functional)
  • Fanatic community

Oddly enough, high-level language developers seem to be much more passionate about their favorite language than low- or mid-level developers.  I’m not entirely sure why it has to be this way, but the trend has been far too universal to ignore (Python, Perl, Ruby, etc).  Ruby is of course the canonical example of this primarily because of the sky-rocket popularity of Rails, but any high-level language has its fanatic evangelists.

What’s really interesting about many high-level languages is the tendency to fall into a hybrid paradigm category.  Python for example is extremely object-oriented, but also allows things like closures and first-class functions.  It’s not as powerful in this respect as a language like Scala (which allows methods within methods within methods), but nevertheless it is capable of representing most elements of a pure-functional language.

As an aside, high-level languages usually perform poorly compared with low- or even mid-level languages.  This is merely a function of the many layers of abstraction between the code and the machine itself.  One instruction in Ruby may translate into literally thousands of machine words.  Of course, high-level languages are almost exclusively used in situations where such “raw-metal” performance is unnecessary, but it’s still a language trait worth remembering.

Conclusion

It’s important to remember that I’m absolutely not recommending one language or “level” over another for the general case.  The very reason we have such a gradient variety of language designs is that there is a need for all of them at some point.  The Linux kernel could never be written in Ruby, and I would never want to write an incremental backup system in Assembly.  All of these languages have their uses, it’s just a matter of identifying which language matches your current problem most closely.

ActiveObjects 0.7 Released

22
Dec
2007

Just in time for the Christmas holidays, I give you ActiveObjects version 0.7!  This release sports a whole slew of minor performance tweaks and bugfixes, providing better reliability.  Also, we’ve greatly increased the number and scope of the unit tests running against the core.  This gives us more confidence that the code is indeed as stable as we hope.

Actually, the big ticket item for 0.7 is the introduction of polymorphic relations.  This feature is fully tested and stable, so I don’t anticipate it introducing any issues in the coming days as we narrow our focus to 1.0.  However, no feature is so well tested as to be certainly bug free under all circumstances.  I look forward to fixing any problems you may encounter with the library!

A few minor feature additions:

  • Support for persisting enum values
  • One-to-one relations
  • EntityManager#flush(…) now also flushes the relations cache

Looking toward the road ahead, things are really starting to settle down a bit.  The main feature I’m looking to implement for 0.8 is extensible value caching, which should allow you to determine how and where the cached field values are stored.  The main use-case behind this is support for memcached, which has been requested a few times and has actually been on the agenda since 0.3.  With memcached driving the value cache, ActiveObjects should be almost perfectly scalable.

To that end, it’s also worth mentioning that I did some profiling (using the NetBeans Profiler) using a Wicket-based web application I’m developing on the side.  I’m happy to report that the overhead imposed by ActiveObjects is extremely minimal.  In each test, the bottleneck was in the PreparedStatement#execute method (in JDBC).  Since ActiveObjects generates very streamlined and natural queries, it seems that this is really the minimal execution time possible for such database access.  Extrapolating further from this data, and given that the JDBC exec time was avg 40ms, while the execution time in the ActiveObjects API was on the order of 1ms-5ms, we can safely say that ActiveObjects is less than 10% slower than using JDBC directly (compare that to the 50% overhead claimed by ActiveRecord).  This “estimate” doesn’t even take into account the many performance advantages offered by ORMs over hand-coded database access (such as relations caching, optimal query construction, etc).

Anyway, enough shameless boasting.  As Hibernate has often pointed out, ORM benchmarks are usually worse than useless.  The only benchmark which really matters is how well the ORM performs in your application.  Try it out, see what you think.

Wide World of Pool Providers: Side-by-Side Comparison

13
Nov
2007

It seems that for any conceivable functionality in Java, there exist a myriad of frameworks which accomplish the task in more-or-less the same way.  ORMs for example; I can count five different Java ORMs without even trying, and I’m sure that number would expand exponentially if I actually sat down and used Google to get a more precise estimate. 

Just like any other function, there seems to be a glut of frameworks which provide JDBC connection pooling.  Choosing between these frameworks can sometimes be a daunting task.  After all, what qualifications do you look at?  Performance?  Licensing?  Documentation?  Connection pooling is such an (apparently) small part of an application’s infrastructure that many development teams devote very little time to this vital selection process.

Due to this management shortsightedness, many projects will simply use whatever pooling library is default for the ORM their using, or (more likely) the first Google hit when searching for “java connection pool”.  Of course, this will get you something which is usable, but rarely will you arrive at a pool provider which is optimal.

To help you to make a more informed decision in this vital aspect of project design, I hereby present some of the lessons I have learned on the subject while working on ActiveObjects.  All benchmarks were run against MySQL 5.0 running on Windows Vista Premium, 2 Ghz Intel Core2Duo, 2 GB DDR2, 7200 RPM SATA drive.  Each test was run five times (executing the DDL each time) with the average runtime taken as the result.  Any obviously poor results (several seconds above the mean) were dropped and re-tested.  All tests were run using the Eclipse JUnit 4 test runner.

commons-dbcp

This is quite possibly the most commonly used pool provider (by default backed by commons-pool), mainly because it used to come up first on Google.  Though, perhaps more important is the fact that commons-dbcp is an Apache sub-project.  This gives it both a less-restrictive license than some other projects, as well as a credibility that comes with being hosted at Apache.  Honestly, if I see a project’s URL contains “apache.org”, I immediately give it the benefit of the doubt, assuming that it will be of reasonable to high quality.  There’s certainly something to be said for reputation…

commons-dbcp is probably the easiest pool provider I’ve seen in terms of API and “just work”-ness.  It has two mechanisms for setting up connections: alternative JDBC URI and a JNDI DataSource implementation.  It’s interesting to note here that the JDBC javadocs state that DataSource is the preferred way to retrieve connections, while the commons-pool javadoc asserts that the alternative URI method passed directly to DriverManager is preferable.  In practice, I find myself using the DataSource method for connection pools, mainly because it reminds me that I’m not dealing with a normal connection creation, but something which is potentially pooled:

BasicDataSource ds = new BasicDataSource();
ds.setDriverClassName(jdbcDriver);
 
ds.setUsername(getUsername());
ds.setPassword(getPassword());
ds.setUrl(getURI());
 
ds.setMaxActive(20);
 
// get connection here
Connection conn = ds.getConnection();
 
// dispose of pool
ds.close();

It’s important to note here that the pool is explicitly disposed. This is always a good idea, even for pools which don’t state the requirement in their documentation.  An undisposed pool can hold database connection open, tying up resources and dragging your database performance through the dirt.  Always, always, always dispose of your connection pools when you’re done with them.

So the API seems pretty intuitive here.  All of the methods do exactly what one would expect.  What’s more, the entire library is extremely well documented.  There’s quite a bit of material on the commons-pool project page discussing how to get started, what best practices to follow, etc.  The public API is javadoc’d, and there are a number of examples available.  I was up-and-running with the framework a few short minutes after I punched in the URL to my address bar.

One thing I haven’t addressed yet is performance.  It’s vitally important that the connection pool chosen run as efficiently as possible.  After all, its whole purpose is to optimize access and reduce the strain on the database in the form of connection create as well as statement compilation.  Obviously all of the really interesting stuff is in this segment of the library.  If this code performs poorly, it would be a very bad idea to try and use the framework for any sort of serious project.

I just happen to have a reasonably comprehensive database benchmark handy in the form of the ActiveObjects JUnit test suite.  ActiveObjects uses a reasonable number of JDBC features (it doesn’t use many conventional Statement(s) or stored procedures).  Since neither the suite nor the library itself changes between benchmarks, we can test arbitrary connection pools easily and receive reasonably accurate results.

Continuing my recent obsession with HTML tables and their use in product reviews, here is the obligatory “five second rundown”:

Documentation Excellent
API Easy and intuitive
License Apache License 2.0
AO Test Suite Run Time 20.6302 seconds

C3P0

C3P0 is another very common connection pool framework, partially because it’s the default pool used with the ever-popular Hibernate ORM.  Unlike commons-pool, C3P0 is actually hosted at SourceForge, that ever popular source of dead open-source projects and over-ambitious specs.  With that said, C3P0 is actually quite respectable as a framework and seems to have avoided the premature fate which befalls most open-source frameworks: developer boredom.

Unfortunately, like so many projects on SourceForge, C3P0 does not have a separate website.  The maintainers opted to stick with the SourceForge project interface as the sole source of “official” information.  Add to that the fact that they decided not to add anything to the “Documentation” section of the page and you arrive at one very frustrating first impression for a new user.  Fortunately there’s a lot of material on using C3P0 (both with and without Hibernate) available around the internet.  Always remember: Google is your friend.

ComboPooledDataSource cpds = new ComboPooledDataSource();
cpds.setDriverClass(jdbcDriver);
 
cpds.setJdbcUrl(getURI());
cpds.setUser(getUsername());
cpds.setPassword(getPassword());
 
cpds.setMaxPoolSize(20);
cpds.setMaxStatements(180);
 
// get connection here
Connection conn = cpds.getConnection();
 
// dispose of pool
try {
    DataSources.destroy(cpds);
} catch (SQLException e) {
}

The API is somewhat similar to that of commons-dbcp.  Both use the DataSource API as a foundation (which is the “right” approach according to the JDBC docs), and both allow roughly the same configuration options on a pool.  At face value, the APIs seem similar to the point that comparison between the two on such a level would be pointless.

One very important “feature” of the C3P0 library is its license: LGPL.  For those of you who don’t know, LGPL is basically identical to the famed GPL 2.0 without the so-called “viral clause”.  GPL pre-3.0 has some legal ambiguity relating to “derivative works” and what qualifies as such.  For this reason, many projects (especially commercial applications written using object-oriented languages) tend to shy away from libraries licensed as such.  LGPL of course doesn’t have this problem, so it has seen moderately better acceptance from the corporate gods.  Unfortunately, it is still a fairly restrictive license relating to other matters such as redistribution.  It is fairly common practice to perform what can be described as “static linking” of JAR files for an application (un-JARring dependencies and then re-JARring them into the main application JAR).  This is something which is prohibited under LGPL, thus restricting deployment options somewhat.  Also, if I remember correctly any non-LGPL application or framework using a LGPL licensed dependency must include a copy of LGPL somewhere in the application (the About or Help section springs to mind).  It was due to this licensing that a company I recently worked for decided against using C3P0 for its application.  Of course, everyone’s requirements are different, but you should still be aware of the possible consequences of using a restrictively licensed framework.

Documentation Lousy
API Easy and intuitive
License LGPL
AO Test Suite Run Time 17.277 seconds

Proxool

Like C3P0, Proxool is an open-source pooling library hosted on SourceForge.  Thankfully, unlike C3P0, Proxool’s maintainers actually took the time to build a full site for the project, containing documentation and examples.  Unfortunately for us, the examples don’t do much good.

Proxool’s documentation is obfuscated and hidden away, making it somewhat difficult to get started with the framework.  Unintuitively enough, the “Quick start” section is of very little help when trying to actually use the library.  Oh it does contain samples, but in my tests I couldn’t get the samples to run successfully.  Add to this the fact that Proxool is a less well-known framework and you lead to some very frustrating experiences trying to get up and running.

To their credit, the Proxool maintainers have written quite a bit of documentation which covers a great deal of the framework functionality.  Organizing a project page intuitively is very hard, it’s just a shame that would-be adopters of the framework have to pay the penalty. 

So to make things easier for others like myself looking to try the framework, here’s the basic setup code for a Proxool pool:

Class.forName(jdbcDriver);
 
Properties props = new Properties();
props.setProperty("proxool.maximum-connection-count", "20");
props.setProperty("user", getUsername());
props.setProperty("password", getPassword());
 
String driverUrl = getURI();
String url = "proxool.mypool:" + jdbcDriver + ":" + getURI();
 
ProxoolFacade.registerConnectionPool(url, props);
 
// get connection here
Connection conn = DriverManager.getConnection("proxool.mypool");
 
// dispose of pool
try {
    ProxoolFacade.removeConnectionPool("mypool");
} catch (ProxoolException e) {
}

Hardly intuitive I’d say.  Nevertheless, the above code seems to get the job done.

One of the debatable advantages to the Proxool library is that it allows developers to take advantage of connection pooling simply by using a special JDBC URI prefix (commons-dbcp allows this too).  I’m not using that syntax in the above example mainly because I think that developers are better served remembering when they are or are not using a pool.  Also, I never could quite get the syntax working (again, poorly structured documentation).

One interesting feature of Proxool that’s worth mentioning is that it allows developers access to things like pool stats, event listeners and so on.  I believe these features are completely unique to Proxool, and while they’re not very interesting in a small test application, imagine the power which can be unleashed in a real-world application.  Exposing this information through something like JMX could make tracing and debugging of database bottlenecks on a production server significantly easier.

Documentation Frustrating
API Poor
License Apache License
AO Test Suite Run Time 18.6406 seconds

Benchmark Comparison

In terms of raw performance, C3P0 comes out ahead by almost a second and a half.  For a short-running test suite like that of ActiveObjects, that’s a fairly impressive difference.  That translates into hours of clock time saved on a database-intensive application of the course of a few weeks.  In my book, that’s something seriously worth considering.

Proxool came in a solid second place, at eighteen and a half seconds.  It’s definitely slower than C3P0, but it’s a full two seconds faster than commons-dbcp.  Considering that Proxool is licensed under the far less restrictive Apache License, it may be worth sacrificing the odd millisecond per query, depending on the opinion of your legal department.

commons-dbcp was the slowest of the three benchmarked at a disappointing twenty and one half seconds.  I’m not entirely sure why DBCP is so much slower in its default, commons-pool backed implementation.  However, the fact remains that performance-wise, it isn’t even worth comparing with C3P0.  Seems I need to make some changes in the classpath of some of my projects…

Throughout the whole benchmarking process, I was constantly reminding why Vista is so notoriously difficult as a host OS for application benchmarks.  The results were constantly fluctuating dramatically up and down, based on how much Vista had superloaded, indexing state, open apps, etc.  In short, Vista was so frustratingly difficult to deal with in the testing process that the test results should be treated with some skepticism.  After all, it’s hard to say that this is empirical, hard evidence when I’m throwing away three quarters of the test results due to vast deviation from the mean.

Conclusion

I must (grudgingly) admit that C3P0 is probably the best choice for most projects.  I say grudgingly because the extreme lack of documentation really bothers me.  Granted, Proxool, the next closest in performance, only has an advantage in licensing; its documentation is no better than C3P0’s.  Proxool of course has the added disadvantage of having a difficult API, as well as less popularity, therefore fewer articles and samples available around the web.

So if you’re a license purist, and you want an intuitive API at the expense of performance, commons-dbcp is the way to go.  However, if you’re willing to work within the restrictions of the LGPL license and you know how to use Google effectively, C3P0 would be the preferred choice, given its higher performance and excellent configurablility.

Update: I didn’t have time to run the benchmarks in any sort of rigorous way (see aforementioned whining about Vista’s benchmark flakiness), but preliminary runtimes indicate that DBPool is an even better framework, performance-wise.  It has a less restrictive license than C3P0, and seems to have a second to a second and a half edge in runtime. Again, these are just quick numbers I grabbed as I was adding support for the provider to ActiveObjects, but I thought it was worth mentioning.