- Code Commit - http://www.codecommit.com/blog -

Bencode Stream Parsing in Java

Posted By Daniel Spiewak On July 15, 2008 @ 12:00 am In Java | 8 Comments

It’s surprising how universal XML has become.  It doesn’t seem to matter what the problem, XML is the solution.  For example, consider a simple client/server architecture where the communication protocol must transmit some sort of structured data.  Nine developers out of ten will form the basis of the protocol around XML.  If it’s a lot of data to be transferred, then they will compress the XML using Java’s stream compression libraries.  If there’s binary data to be transmitted, it will either be stored as CDATA within the XML or as files within the same compressed archive.  Very few developers will actually stop and consider alternative solutions.

One such “alternative solution” is bencode [1] (pronounced “bee-encode”).  Similar to formats like XML and JSON, bencode defines a series of constructs which may be used to encode arbitrarily complex data.  However, unlike XML, the design focus of the format was not to produce verbose, human-readable documents, but rather to encode data in the most concise manner possible.  To that end, the core bencode specification only includes four data types, two simple and two composite structures.  These types are defined with an almost complete absence of meta, requiring very little “structure” to clutter the data stream.

Unfortunately, outside of applications like BitTorrent, this elegant binary format has seen remarkably little adoption.  Because of this state of affairs, it can be extremely difficult to find libraries to actually process bencode data.  Not too long ago, I ran into a production use-case which required both parsing and generation of bencode-formatted files.  I considered digging into the source code for Vuze [2] (nee “Azureus”), but a) it seemed like a lot of boring, nearly-wasted effort, and b) I strongly suspect that their bencode parser and generator are extremely space inefficient [3], since the data sources which they deal with are remarkably small.

The second hang-up was really a more significant motivator than the first, due to the fact that I knew I would be dealing with bencode streams potentially gigabytes in size.  So, rather than fruitlessly dig through someone else’s code, I decided to put all of this formal parser theory [4] to work and roll my own library.  Unless you’re already familiar with bencode, I suggest you read the Wikipedia article [1] to get a feel for the format, otherwise some of what I will be talking about will make no sense at all.  :-)

The first thing I needed to do was build the generation half of the library.  I decided that it would be easier if I avoided trying to use the same backend framework classes with both the generator and the parser.  For example, there are actually two classes in the framework which contain the logic for handling an integer: IntegerValue and IntegerType.  The former is for use in the parser, while the latter is for use in the generator.  This separation of logic may seem a little strange, but it actually simplifies things tremendously.

Remember my primary requirement: extremely efficient implementation of both generator and parser, especially with respect to space.  If I attempted to use the same classes to represent data for both the parser and the generator, then the parser would be forced to read the entire stream into some sort of in-memory representation (think about it; it’s actually true).  Obviously, this is unacceptable for streams that are gigabytes in size, so the traditional “good design” from an object-oriented standpoint was out.

Stream Generation

Since I needed the functionality of bencode stream generation before I needed parsing, I started with that aspect of the framework.  Here again, the most obvious “object-oriented” approach would have been the wrong one.  When we think of generating output in a structured format programmatically, we naturally imagine a DOM-like tree representation (preferably framework-agnostic) which is then walked by the framework to produce the output.  The major disadvantage to this approach is that it requires paging everything into memory.  This works for smaller applications or situations where the data is already in memory, but for my particular use-case, it would have been disastrous.

The only way to avoid paging everything into memory for stream generation is to structure the API so that the data is “pulled” by the generator, rather than “pushed” to it in tree-form.  In other words, the data itself has to be lazy-loaded, using callbacks to grab the data as-needed and hold it in memory only as long as is absolutely necessary.  In a functional language, this would be done with closures (or even normal data types in a pure-functional language).  However, as we all know, Java does not support such time-saving features.  The only recourse is to use abstract classes and interfaces which can be overridden in anonymous inner-classes as well as top-level classes as necessary.

image

After a bit of experimentation, the finalized hierarchy looks something like this.  Logically, every type must be able to query its abstract method for data of a certain Java type (long for IntegerType, InputStream for StringType, etc), convert this data into bencode with the appropriate meta, and then write the result to a given OutputStream.  Also following our nose, we see the semantic differences between composite and primitive types are really quite limited, especially if we simplify everything to a black box “get data / write encoding” methodology.  In fact, the only thing that CompositeType actually does is enforce the prefix/suffix encoding of every composite type.  Since this is in compliance with the bencode specification, we are safe in extracting this functionality into a superclass.

The more interesting distinction is between so-called “variant” and “invariant” types.  This is where you should begin to notice that I have over-engineered this library to some degree.  If I was just trying to create a pure bencode generator, then I could have skipped InvariantPrimitiveType and VariantPrimitiveType and just let IntegerType and StringType extend PrimitiveType directly.  This comes back to my initial requirements.

Priority one was to create a framework which was blazingly fast, but priority two was to ensure that it was extensible at the type level.  For the particular application I was interested in, I required more than just the core bencode types.  Also on the agenda were proper UTF-8 strings, dates, and support for null.  To accommodate all of this without too much code duplication, I knew I would have to extract a lot of the functionality into generic superclasses.  Hence my somewhat incorrect use of the terms “variant” and “invariant” to describe the difference between the integer type – which is prefix/suffix delimited – and the string type – which defines a length as its prefix and has no closing suffix.

Anyway, back to the problem at hand.  In addition to the CompositeType and PrimitiveType, you should also notice EntryType.  This “extra” type exists to handle the fact that bencode dictionaries are extremely weird and sit rather outside the “common functionality” umbrella of the format in general.  For one thing, the specification requires that dictionary entries be sorted by key, obviously implying some sort of Comparable relation.  Moreover, these keys must be themselves strings, but StringType isn’t comparable because its writeValue(OutputStream) method doesn’t return the data in question, but merely writes it to a given OutputStream.  Are we starting to see the problems with space-efficient implementations?

Enough babble though, let’s see some code!  Here’s how we might encode some very simple data using the generator framework:

public class GeneratorTest {
    public static void main(String[] args) {
        ByteArrayOutputStream os = new ByteArrayOutputStream();
        final byte[] picture = new byte[0];        // presumably something interesting
 
        DictionaryType root = new DictionaryType() {
            @Override
            protected void populate(SortedSet<EntryType<?>> entries) {
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("name"), 
                        new LiteralStringType("Arthur Dent")));
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("number"), 
                        new IntegerType(42)));
 
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("picture"), 
                        new StringType() {
 
                    @Override
                    protected long getLength() {
                        return picture.length;
                    }
 
                    @Override
                    protected void writeValue(OutputStream os) throws IOException {
                        os.write(picture);
                    }
                }));
 
                entries.add(new EntryType<LiteralStringType>(
                        new LiteralStringType("planets"), 
                        new ListType() {
 
                    @Override
                    protected void populate(ListTypeStream list) throws IOException {
                        list.add(new LiteralStringType("Earth"));
                        list.add(new LiteralStringType("Somewhere else"));
                        list.add(new LiteralStringType("Old Earth"));
                    }
                }));
            }
        };
 
        try {
            root.write(os);
        } catch (IOException e) {
            e.printStackTrace();
        }
 
        System.out.println(new String(os.toByteArray()));
    }
 
    private static class LiteralStringType extends StringType 
            implements Comparable<LiteralStringType> {
        private final String value;
 
        public LiteralStringType(String value) {
            this.value = value;
        }
 
        @Override
        protected long getLength() {
            return value.length();
        }
 
        @Override
        protected void writeValue(OutputStream os) throws IOException {
            os.write(value.getBytes("US-ASCII"));
        }
 
        public int compareTo(LiteralStringType o) {
            return o.value.compareTo(value);
        }
    }
}

It’s hard to imagine why some people claim that Java is a verbose language…

The API may seem a little clumsy, but most of that is caused by the conniptions required to make the generator lazily pull the data, rather than paging it all into memory ahead of time.  Throwing that aside, the rest of the verbosity seems to come from the need for LiteralStringType, rather than just having a StringType which could handle this for us.  The reason for this extra headache is shown in the population of the “picture” field, which presumably may contain several megabytes worth of data from some external source such as a file or database (in this case of course, it doesn’t contain anything, but that’s besides the point).

The result of the above is as follows:

d4:name11:Arthur Dent6:numberi42e7:picture0:7:planetsl5:Earth14:Somewhere else9:Old Earthee

Or, with a little formatting to make it more palatable:

d
  4:name
  11:Arthur Dent

  6:number
  i42e

  7:picture
  0:

  7:planets
  l
    5:Earth
    14:Somewhere else
    9:Old Earth
  e
e

Technically, this is no longer valid bencode, but it is much easier to read this way.

The Parser

With all this bustle surrounding the generator, it’s easy to forget about the inverse process: parsing.  As it turns out, this is both easier and far less elegant than the solution for the generator (I know, it’s a sad state of affairs when the above is considered “elegant”).  Here again, there was a need for the parser to be extremely efficient, especially in terms of memory.  Thus, the logical approach of simply parsing the stream into an in-memory tree doesn’t really work.  Instead, the parser must be a so-called “pull parser”, which only parses each token upon request.  The parser only does exactly what work you ask of it, nothing more.

My initial designs for the parser attempted to follow the example set by the generator: each value type self-contained, responsible for parsing its own format.  As it turns out, this can be difficult to accomplish.  I could have expanded slightly on the parser combinator concept, but monads are very clumsy to achieve in Java, which led me to rule out that option.  In the end, I took a middle ground.

Click for full size [5]

As before, a common superinterface sits above the entire representative hierarchy.  To understand this hierarchy a little better, perhaps it would be helpful to look at the full source for Value:

public interface Value<T> {
    public T resolve() throws IOException;
    public boolean isResolved();
}

The resolve() method is really the core of the entire parser.  The concept is that each value will be able to consume the bytes necessary to determine its own value, which is converted and returned.  This is extremely convenient because it enables VariantValue(s) (such as string) to carry the logic for parsing to a specific length, rather than the conventional e terminator.  In order to avoid clogging up memory, the return value of resolve() should not be memoized [6] (though, there is nothing in the framework to prevent it).  Conventionally, values which are already resolved should throw an exception if they are resolved a second time.  This prevents the framework from holding onto values which are no longer needed.

You will also notice that CompositeValue not only inherits from Value, but also from the JDK interface, Iterable.  Logically, a composite value is a linear collection of values, consumed one at a time.  To me, that sounds a lot like a unidirectional iterator.  We can, of course, resolve the entire composite at once, mindlessly consuming all of its values, but since all of the values are lost once consumed, the only purpose for such an action would be if we know that we don’t care about a particular composite and we just want to rapidly skip to the next value in the stream.

Returning to primitive values, the resolve() method for IntegerValue is worthy of note, not so much for its uniqueness, but because it is very similar to the parsing technique used in all the other values:

public Long resolve() throws IOException {
    if (resolved) {
        throw new IOException("Value already resolved");
    }
    resolved = true;
 
    boolean negative = false;
    long value = 0;
 
    int b = 0;
    while ((b = is.read()) >= 0) {
        int digit = b - '0';
 
        if (digit < 0 || digit > 9) {
            if (b == '-') {
                negative = true;
            } else if (b == 'e') {
                break;
            } else {
                throw new IOException("Unexpected character in integer value: " 
                    + Character.forDigit(b, 10));
            }
        } else {
            value = (value * 10) + digit;
        }
    }
 
    if (negative) {
        value *= -1;
    }
 
    return value;
}

The i prefix itself is consumed before control flow even enters this method.  This is because the prefix is required to determine the appropriate value implementation to use.  Specifically, the logic to perform this determination is contained within the Parser class, which maintains a map of Value(s) and their associated prefixes.  String values have special logic associated with them, as they do not have a prefix.

As with most hand-coded parsers, this one operates on the principle of “eat until it hurts”.  We start out by assuming that the integer value extends to the end of the stream, then we set about to find a premature end to the integer, at which point we break out and call it a day.  Since we are moving from left to right through a base-10 integer, we must multiply the current accumulator by 10 prior to adding the new digit. 

Actually, the real heart of the parser framework is CompositeValue.  This class is inherited by Parser to define a special value encompassing the stream itself (which is viewed as a composite value with no delimiters and only a single child).  This unification allows us to keep the code for parsing a composite stream in a single location.  This implementation is a little less concise than the code for parsing an integer, but it follows the same pattern and is fairly instructive:

protected final Value<?> parse() throws IOException {
    if (resolved) {
        throw new IOException("Composite value already resolved");
    }
 
    if (previous != null) {
        if (!previous.isResolved()) {
            previous.resolve();        // ensure we're at the right spot in the stream
        }
    }
 
    byte b = -1;
    if (readAhead instanceof Some) {
        b = readAhead.value();
        readAhead = new None<Byte>();
    } else {
        b = read();
    }
 
    if (b >= 0) {
        Class<? extends Value<?>> valueType = parser.getValueType(b);
 
        if (valueType != null) {
            return previous = Parser.createValue(valueType, parser, is);
        } else if (b > '0' && b <= '9') {
            return previous = readString(b - '0');
        } else if (b == ' ' || b == '\n' || b == '\r' || b == '\t') {
            return parse();        // loop state
        } else {
            throw new IOException("Unexpected character in the parse stream: " 
                + Character.forDigit(b, 10));
        }
    }
 
    throw new IOException("Unexpected end of stream in composite value");
}
 
private final StringValue readString(long length) throws IOException {
    int i = is.read();
 
    if (i >= 0) {
        byte b = (byte) i;
 
        if (b == ':') {
            return Parser.createValue(StringValue.class, parser, 
                new SubStream(is, length));
        } else if (b >= '0' && b <= '9') {
            return readString((length * 10) + b - '0');
        } else {
            throw new IOException("Unexpected character in string value: " 
                + Character.forDigit(i, 10));
        }
    }
 
    throw new IOException("Unexpected end of stream in string value");
}

It seems a bit imposing, but really this code is more of the same logic we saw previously when dealing with integers.  The only value type which really gives us trouble here is string.  We can’t simply treat it like the others because it has no prefix.  For this reason, we must assume that any unbound integer is an inclusive prefix for a string.  In most parser implementations, this would require backtracking, but because we are doing this by hand, we can condense the backtrack into an inherited parameter (borrowing terminology from attribute grammars [7]), avoiding the performance hit.

There’s one final bit of weirdness which deserves attention before we bail on this small epic: dictionary values.  Intuitively, a dictionary value should be parsed into a Java Map, or some sort of associative data structure.  Unfortunately, a map is by definition a random access data structure.  Since we are dealing with a sequential bencode stream, the only recourse to satisfy this property would be to page the entire dictionary into memory.  This of course violates one of the primary requirements which is to avoid using more memory than necessary.

The solution I eventually chose to this problem was to limit dictionary access to sequential, which translates into alphabetical given the nature of bencode dictionaries.  Thus, a dictionary can be parsed in the same way as a list, where each element is a sequential key and value, jointly represented by EntryValue.  To make usage patterns slightly easier, EntryValue memoizes the key and value.  Due to the fact that both of these objects are themselves Value(s), this does not lead to inadvertent memory bloat.

Conclusion

Hopefully the parser and generator presented here will be of some utility in situations where you have to parse large volumes of bencoded data.  The API is (admittedly) bizarre and difficult to deal with, but the performance results are difficult to deny.  This framework is currently deployed in production, where benchmarks have shown that it imposes little-to-no runtime overhead, and practically zero memory overhead (despite the sizeable amounts of data being processed).

For convenience, I actually created a Google Code project [8] for this framework so as to facilitate its development internally to the project I was working on.  The end result of this is unlike most of my experiments, there is actually a proper SVN from which the source may be obtained!  A packaged JAR may be obtained from the downloads section.


Article printed from Code Commit: http://www.codecommit.com/blog

URL to article: http://www.codecommit.com/blog/java/bencode-stream-parsing-in-java

URLs in this post:

[1] bencode: http://en.wikipedia.org/wiki/Bencode

[2] Vuze: http://www.vuze.com

[3] space inefficient: http://en.wikipedia.org/wiki/Computational_complexity_theory#Time_and_space_complexity

[4] formal parser theory: http://www.codecommit.com/blog/scala/naive-text-parsing-in-scala

[5] Image: http://www.codecommit.com/blog/misc/parser-classes.png

[6] memoized: http://en.wikipedia.org/wiki/Memoization

[7] attribute grammars: http://en.wikipedia.org/wiki/Attribute_grammar

[8] Google Code project: http://code.google.com/p/jbencode/

[9] jbencode.jar: http://jbencode.googlecode.com/files/jbencode.jar

[10] Full sources: http://code.google.com/p/jbencode/source/checkout

All content copyright © 2010 Daniel Spiewak. Distributed under Creative Commons.
All code (unless otherwise stated) is licensed under the BSD License.