Skip to content

Monads Are Not Metaphors


This article is also available in Japanese.

I am about to break a promise. Almost three years ago, I promised myself that I would never write an article about monads. There are too many such articles already; so many, in fact, that people are often confused by the sheer proliferation. Everyone seems to have a different take on the subject, meaning that those attempting to learn the concept for the first time are stuck trying to reason out the commonalities between burritos, space suits, elephants and desert Bedouins.

I’m not going to add to this menagerie of confusing analogies. The fact is that none of these parallels are entirely accurate. None of them convey the whole picture, and some of them are blatantly misleading in important respects. You will never come to understand monads by pondering Mexican food and the Final Frontier. The only way to understand monads is to see them for what they are: a mathematical construct.

Math (or not)

Here’s the thing about monads which is hard to grasp: monads are a pattern, not a specific type. Monads are a shape, they are an abstract interface (not in the Java sense) more than they are a concrete data structure. As a result, any example-driven tutorial is doomed to incompleteness and failure. The only way to really understand is to take a step back and look at what monads mean in the abstract rather than the concrete. Take a look at the following Ruby snippet:

def foo(bar)
  puts bar

Just as a quick Ruby refresher, we can rewrite this code in the following way:

def foo(bar)
  puts bar; bar.size

Ruby has this neat convention (which is shared by most modern languages) which causes the final expression in a method to be turned into the implicit return statement. Thus, the foo method will take a parameter, print it to standard out and then return its size. Fairly simple, right?

Here’s the puzzler: what is the semicolon (;) doing? It’s tempting to say that it’s just a separator, but theoretically speaking, there’s something much more interesting going on here. Let’s switch to Scala and add some Christmas trimmings:

def foo(bar: String) = {
  ({ () => println(bar) })()
  ({ () => bar.length })()

Just in case you’re unfamiliar with Scala, I’d like to make it clear that we are not required to enclose every statement inside its own lambda (anonymous function). I’m just doing that to make a point.

This function does exactly the same thing as the Ruby version. Well, the parameter is a bit more constrained since we require a String rather than accepting anything defines size, but moving past that… The major difference from what we had previously is that each statement is wrapped inside its own anonymous function, which we immediately call. We can again play the same semicolon trick that we used in Ruby. However, because the statements are actually functions, we can go a step further:

def foo(bar: String) = {
  ({ () => println(bar) } andThen { () => bar.length })()

(note: the andThen method isn’t defined for functions of 0-arity, but we’re going to pretend that it is and that it works the same as it does for functions of one argument. If it makes you feel better, you can pretend that these are both one-argument functions taking Unit as a parameter, the theoretical implications are the same, it just requires more syntax)

Notice that we haven’t actually used the semicolon (although we could have). Instead, we’re combining two functions together and invoking them at the very end. The semantics we’re using to do the combination are such that the first function will be evaluated, then its result (()) discarded and the second function evaluated with its result returned. For those following along at home, we could easily define andThen in the following way:

def funcSyntax[A](f1: () => A) = new {
  def andThen[B](f2: () => B) = f1(); f2()

In a way, we have defined a method which literally encapsulates the effect of the semicolon “operator”, allowing us to apply it directly to functions, rather than dealing with it indirectly at the level of statements. That’s kind of a cool thought, but the important point is that we are first executing the first function, discarding its result and then executing the second function, returning its result.

It should be clear that we could extend this to any number of functions. For example:

def foo(bar: String) = {
  ({ () => println("Executing foo") } andThen
   { () => println(bar) } andThen
   { () => bar.length })()

Still with me? Congratulations, you’ve seen your first monad.

You Could Have Invented Monads! (and maybe you already have)

This certainly isn’t a monad in the traditional sense, but if we worked at it, we could show that the monadic axioms do hold. The significant point here is what this monad is doing: combining one thing together with another in sequence. In fact, this is what all monads do, deep down. You start out with Thing One, and you have a function which will (given One) will give you Thing Two. Monads let you combine Thing One and your function together, producing a final resultant Thing. Let’s look at some more code:

case class Thing[+A](value: A)

This is about the simplest container imaginable (in fact, it is precisely the simplest container imaginable, but that’s not relevant now). We can wrap up values inside of Thing, but that’s about it:

val a = Thing(1)
val b = Thing(2)

Now, let’s switch into design mode for a moment. Imagine that we find ourselves writing a lot of code which looks like this:

def foo(i: Int) = Thing(i + 1)
val a = Thing(1)
val b = foo(a.value)        // => Thing(2)

We’re starting with a Thing, and then we’re using the value inside of that Thing to call a function which gives us a new Thing. If you think about it, this is actually a very common pattern. We have a value, and then we use that value to compute a new value. Mathematically, this is pretty much the same as the following:

def foo(i: Int) = i + 1
val a = 1
val b = foo(a)              // => 2

The only difference between these is that the first version wraps everything in Thing, while the second version is using “bare” values.

Now, let’s extend our imagine just a bit and assume that we have a good reason for wrapping everything inside of Thing. There could of course be any number of reasons for this, but basically it boils down to the notion that Thing might have some extra logic which does interesting things with its value. Here’s the question: can we come up with a nicer way of going from a to b? Basically, we want to encapsulate this pattern as a more general tool.

What we want is a function which pulls the value out of Thing and then calls another function with that value, returning the result of that function call (which will be a new Thing). Since we’re good object-oriented programmers, we will define this as a method on class Thing:

case class Thing[+A](value: A) {
  def bind[B](f: A => Thing[B]) = f(value)

So if we have a Thing, we can pull its value out and use it to compute a new Thing, all in one convenient step:

def foo(i: Int) = Thing(i + 1)
val a = Thing(1)
val b = a bind foo          // => Thing(2)

Notice that this is a lot cleaner that our original version, while still performing exactly the same function. Thing is a monad.

The Monad Pattern

Any time you start with something which you pull apart and use to compute a new something of that same type, you have a monad. It’s really as simple as that. If it sounds like I’m describing almost all of your code, then good, that means you’re starting to catch on. Monads are everywhere. And by “everywhere”, I do mean everywhere.

To understand why this is, let’s look at what it is that makes Thing a monad:

val a = Thing(1)

The first thing is that I can wrap up a value inside of a new Thing. Object-oriented developers might call this a “constructor”. Monads call it “the unit function”. Haskell calls it “return” (maybe we shouldn’t try to figure out that one just yet). Whatever you call it though, it comes to the same thing. We have a function of type A => Thing; a function which takes some value and wraps it up inside a new Thing.

a bind { i => Thing(i + 1) }

We also have this fancy bind function, which digs inside our Thing and allows a function which we supply to use that value to create a new Thing. Scala calls this function “flatMap“. Haskell calls it “>>=“. Again, the name doesn’t matter. What’s interesting here is the fact that bind is how you combine two things together in sequence. We start with one thing and use its value to compute a new thing.

It’s as simple as that! If you’re like me, then you’re likely asking the following question: if it’s so simple, what’s all the fuss about? Why not just call this the “using one thing to compute another” pattern? Well, for one thing, that’s far too verbose. For another, monads were first defined by mathematicians, and mathematicians love to name things. Mathematics is all about finding patterns, and it’s hard to find patterns if you can’t affix labels once you’ve found them.

More Examples

I said that monads are everywhere (they are!), but we’ve only looked at two examples so far. Let’s see a few more.


This might be the most famous monad of all, quite possibly because it’s one of the easiest to understand and by far the easiest to motivate. Consider the following code:

def firstName(id: Int): String = ...    // fetch from database
def lastName(id: Int): String = ...
def fullName(id: Int): String = {
  val fname = firstName(id)
  if (fname != null) {
    val lname = lastName(id)
    if (lname != null)
      fname + " " + lname
  } else {

Here again, we have a fairly common pattern. We have two functions (firstName and lastName) which are responsible for producing some data which may or may not be available. If the data is available, then it is returned. Otherwise, the result of these functions will be null. We then use these functions to do something interest (in this case, compute a full name). Unfortunately, the fact that firstName and lastName may or may not produce a useful value needs to be handled explicitly with a set of nested ifs.

At first blush, it seems this is the best that we can do. However, if you look very closely, you can find the monad pattern buried in this code. It’s a little more complicated than last time, but it’s still there. Let’s try wrapping everything in Thing to make it clear:

def firstName(id: Int): Thing[String] = ...    // fetch from database
def lastName(id: Int): Thing[String] = ...
def fullName(id: Int): Thing[String] = {
  firstName(id) bind { fname =>
    if (fname != null) {
      lastName(id) bind { lname =>
        if (lname != null)
          Thing(fname + " " + lname)
    } else {

See it now? As I said, monads are everywhere. Here’s the really useful bit though: every time we bind, the very first thing we do inside the function is test the value to see if it is null, why not move that logic into bind? Of course, we can’t really do that without changing Thing into something different, so we will define a new monad called Option:

sealed trait Option[+A] {
  def bind[B](f: A => Option[B]): Option[B]
case class Some[+A](value: A) extends Option[A] {
  def bind[B](f: A => Option[B]) = f(value)
case object None extends Option[Nothing] {
  def bind[B](f: Nothing => Option[B]) = None

If you block out everything except Some, this looks a lot like our old friend, Thing. The main difference is that Option has two different instantiations: Some, which contains a value, and None, which doesn’t contain a value. Think of None as being just an easier way of writing Thing(null).

What’s interesting is that Some and None need to have two different definitions of bind. The definition of bind in Some looks a lot like the definition in Thing, which makes sense as Some and Thing are almost identical. However, None defines bind to always return None, ignoring the specified function. How does this help us? Well, let’s return to our fullName example:

def firstName(id: Int): Option[String] = ...    // fetch from database
def lastName(id: Int): Option[String] = ...
def fullName(id: Int): Option[String] = {
  firstName(id) bind { fname =>
    lastName(id) bind { lname =>
      Some(fname + " " + lname)

All of those nasty if statements have disappeared. This works because firstName and lastName now return None when they fail to fetch the database record, rather than Thing(null). Of course, if we try to bind on None, the result will always be None. Thus, the fullName function returns the combination of firstName and lastName inside of a Some instance only if neither firstName nor lastName return None. If either one returns None, then the result of the whole thing will be None.

For those keeping score at home, we have “accidentally” stumbled upon Groovy’s safe-dereference operator (?.), Raganwald’s andand for Ruby and many, many more. See? Monads are everywhere.


Anyone trying to understand monads will inevitably run into Haskell’s IO monad, and the results are almost always the same: bewilderment, confusion, anger, and ultimately Perl. The fact is that IO is a rather odd monad. It is fundamentally in the same category as Thing and Option, but it solves a very different problem.

Here’s the deal: Haskell doesn’t allow side-effects. At all. Functions take parameters and return values; you can’t just “change” something outside the function. As an example of what this means, let’s return to our earlier Ruby code:

def foo(bar)
  puts bar

This function takes a value, calls its size method and returns the result. However, it also changes the standard output stream. This is pretty much the same as if we had a global array hanging around which we just mutated in-place:

def foo(bar)
  STDOUT += [bar]

Haskell doesn’t have variables of any sort (imagine if Scala didn’t have var, or if every variable in Java were final). Since we don’t have any variables, we can’t change anything in-place. Since we can’t change anything in-place, there’s no way we can define a puts function. At least, not the puts we’re used to.

Let’s switch back to Scala. Our goal here is to define a println function which doesn’t rely on any sort of mutable state (ignoring for the moment the fact that the standard output stream is always going to be mutable, since there’s no way we can “change” the user’s screen by cloning the physical display). One thing we could do is wrap up our standard output stream as a Vector which we carry along with our functions:

def foo(bar: String, stdout: Vector[String]) = {
  val stdout2 = println(bar, stdout)
  (bar.length, stdout2)
def println(str: String, stdout: Vector[String]) = stdout + str

Theoretically, we could write all of our println-enabled functions in this way, passing in the current stdout and receiving the new state as a result. At the end of the day, the entire program would produce a result in addition to the final state of stdout, which could be printed to the screen by the language runtime.

This works (at least for println), but it’s ridiculously ugly. I would hate it if I had to write code like this, and early Haskell adopters felt exactly the same way. Unfortunately, things only get worse from there. Our trick with Vector[String] may work for the standard output stream, but what about standard in? At first blush, it seems as if the readLine function wouldn’t be as bad as println, after all, we aren’t changing anything! Unfortunately, something is clearly being changed at some level, otherwise repeated calls to readLine would yield the same result (clearly not the case).

Graphics updates, networking, the list goes on. It turns out that any useful program is going to need to have side-effects, otherwise there’s no way for all of that usefulness to get outside the program where we can see it. So, Haskell’s designers (more specifically, Phillip Wadler) needed to come up with a way to solve this problem not only for standard out, but for all side-effects.

The solution is actually quite simple. To solve the standard out problem with println, we just passed a Vector[String] around, “modifying” it by returning the new state along with our regular return value. Here’s the inspiration: what if we did that for the entire universe? Instead of passing around just a plain Vector[String], we pass around Universe:

def foo(bar: String, everything: Universe) = {
  val everything2 = println(bar, everything)
  (bar.length, everything2)
def println(str: String, everything: Universe) = everything.println(str)

As long as the language runtime is able to somehow give us an instance of Universe which behaves the way we would expect, then this code will work just fine. Obviously, the runtime can’t really package up the entire cosmos and allow us to get new versions of it, but it can cheat a bit and pretend to give us the entire universe. The language runtime is allowed to implement the println function on the Universe object in whatever way it deems best (hopefully by actually appending to standard out). Thus, we let the runtime perform whatever magic is necessary while we remain blissfully ignorant of any and all side-effects.

This solves our problems alright, but it’s horrible in almost every other respect. We have this manual threading of the Universe, which is both verbose and painful. Even worse, this sort of thing is very error prone (i.e. what happens if we “modify” the universe and then go back and change an older version of it? do we get two, parallel universes?). The heart of our problem now is that we are using the old version of the universe to compute a new version of the universe, and we’re doing that manually. We’re taking something (in this case, the universe) and using its value to compute a new something (the new version of the universe). Sound familiar?

Phillip Wadler’s inspiration was to take advantage of the monad pattern to solve this problem. The result is the IO monad:

def foo(bar: String): IO[Int] = {
  println(bar) bind { _ => IO(bar.length) }
def println(str: String): IO[Unit] = {
  // TODO insert magic incantations

Of course, we can’t actually implement println, since this fake language isn’t allowed to involve itself in side-effects. However, the language runtime could provide a native (read: cheat code) implementation of println which performs its side-effects and then conjures a new version of IO (i.e. the modified universe).

The only catch with this design is that we can never get anything out of IO. Once you start down the dark path, forever will it dominate your destiny. The reason for this is one of language purity. Haskell wants to disallow side-effects, but if it were to allow us to pull values out of IO, then we could very easily bypass its safe-guards:

def readLine(): IO[String] = {
  // TODO insert magic incantations
def fakeReadLine(str: String): String = {
  val back: IO[String] = readLine()
  back.get      // whew!  doesn't work

As you can see, if we could pull values out of IO, then the whole exercise would become a waste of time, since it would be trivially easy to hide side-effects within wrapper functions.

Of course, none of this is particularly relevant to Scala or Ruby, much less Java! Scala doesn’t restrict side-effects. It allows you to call println whenever you feel like it, and it provides a way of declaring mutable variables (var). In a sense, Scala is hiding a hypothetical IO monad. It’s as if all Scala code is implicitly inside of IO, and thus implicitly threading the state of the universe from one moment to the next. In light of this fact, why should we care at all about how IO works? Mostly because it’s a monad which is very different from Thing and Option. I briefly considered using State, but that’s too much ancillary complexity when we’re trying to focus on the core idea of using the value from one thing to compute another thing.

Now What?

So, we’ve identified the monad pattern. We can spot it in code and employ it in an impressive range of situations, but why are we bothering with all the ceremony? If we’re already using monads all over the place without realizing it (e.g. semicolon), then why do we have to bother futzing with all of this nomenclature? Simply put: why do we care that Option is a monad so long as it just “does the right thing”?

Well, the first answer to that lies in the nature of mathematics, and by extension, computer programming. As I said, before, mathematics is all about identifying patterns (well, you also play with those patterns to generate larger, more intricate patterns, but that’s really just a means to an end). Once you identify a pattern, you name it. That’s just what mathematicians do. Imagine if Newton hadn’t named the “derivative”, and he simply insisted on calling it “an expression for the line tangent to a given line relative to a particular value of x, where x is some variable in the expression for the given line.” For one thing, calculus textbooks everywhere would be about 50 times longer. For another, we probably never would have been able to see “the derivative” as an abstract entity. Partial derivation would have never been devised. Integrals, differential equations, infinite series, and almost all of physics might never have happened. None of these consequences have anything to do with the name, that’s just a label. Rather, it was the fact that Newton was able to see (and represent) the derivative in the abstract, as a mathematical shape to be manipulated and applied in novel ways.

If you understand the Option monad, then you can use the Option monad. You can see places in your code where it can be applied, and you will reap enormous benefits as a result. However, if you understand monads as an abstraction, then you will not only understand Option, but also IO, State, Parser, STM, the list goes on. Or at least, you will understand the fundamental properties of these constructs (the rest is details). You will begin to see places where you are doing monadic things even when those things don’t fit exactly into the restricted mold of Option or State. This is where the true utility can be found.

Besides the (vast) improvements in your thought process, there’s also a more immediate practical upshot. It’s possible to define functions which work on monads in the generic sense, rather than specializing in one or the other. Just as Swing programming would be impossible if you had to rewrite every function for each specific instance of Component, there are many things which would be impossible (or at least, very very impractical) if you had to rewrite them for each specific monad. One such function is sequence:

trait Monad[+M[_]] {
  def unit[A](a: A): M[A]
  def bind[A, B](m: M[A])(f: A => M[B]): M[B]
implicit object ThingMonad extends Monad[Thing] {
  def unit[A](a: A) = Thing(a)
  def bind[A, B](thing: Thing[A])(f: A => Thing[B]) = thing bind f
implicit object OptionMonad extends Monad[Option] {
  def unit[A](a: A) = Some(a)
  def bind[A, B](opt: Option[A])(f: A => Option[B]) = opt bind f
def sequence[M[_], A](ms: List[M[A]])(implicit tc: Monad[M]) = {
  ms.foldRight(tc.unit(List[A]())) { (m, acc) =>
    tc.bind(m) { a => tc.bind(acc) { tail => tc.unit(a :: tail) } }

There are a lot of ways to pretty this up, but I wanted to be as explicit as possible for the sake of illustration. The general function of sequence is to take a List of monad instances and return a monad instance of a List of those elements. For example:

val nums = List(Some(1), Some(2), Some(3))
val nums2 = sequence(nums)           // Some(List(1, 2, 3))

The magic is that this function works on any monad:

val nums = List(Thing(1), Thing(2), Thing(3))
val nums2 = sequence(nums)           // Thing(List(1, 2, 3))

In this case, Monad (the trait) is an example of a typeclass. Basically, we’re saying that there is this general idea of a monad, and any monad will define two functions: unit and bind. In this way, define functions which operate on monads without knowing which specific monad we’re manipulating. Think of it like the adapter pattern on steroids (sprinkled with a goodly portion of Scala’s implicit magic).

This is reason number two for understanding the general concept of a monad, rather than just the specific: you suddenly become able to define tons of nifty utility functions. For more examples of this, you need look no further than Haskell’s standard library.

Those Pesky Axioms

Congratulations! You just made it through an entire monad tutorial without ever having to worry about the monadic axioms or what they mean. Now that you’ve got the concept down (hopefully), you can graduate to the axioms themselves.

As it turns out, the monadic axioms are really quite intuitive. We’ve actually been assuming them all along without ever really saying so. The axioms define how the unit (constructor) and bind (composition) functions are supposed to behave under certain situations. Think of them a bit like the laws governing integer addition (commutativity, associativity, etc). They don’t tell you everything about monads (actually, they don’t tell you much at all from an intuitive standpoint), but they do give you the basic, mathematical underpinnings.

Excited yet? No, I didn’t think so. Well, here they are anyway, defined in terms of the Monad typeclass we used earlier:

def axioms[M[_]](implicit tc: Monad[M]) {
  // identity 1
  def identity1[A, B](a: A, f: A => M[B]) {
    val ma: M[A] = tc.unit(a)
    assert(tc.bind(ma)(f) == f(a))
  forAll { (x, f) => identity1(a, f) }        // sort-of ScalaCheck
  // identity 2
  def identity2[A](ma: M[A]) {
    assert(tc.bind(ma)(tc.unit) == ma)
  forAll { m => identity2(m) }
  // associativity
  def associativity[A, B, C](m: M[A], f: A => M[B], g: B => M[C]) {
    val mf: M[B] = tc.bind(m)(f)
    val mg: M[C] = tc.bind(mf)(g)
    val mg2: M[C] = tc.bind(m) { a =>
    assert(mg == mg2)
  forAll { (m, f, g) => associativity(m, f, g) }

The first two axioms (the “identity” ones) are basically saying that the unit function is a simple constructor with respect to the bind function. Thus, when bind “pulls apart” the monad and passes the value to its function parameter, that value will be precisely the value that unit puts into the monad. Likewise, if the function parameter given to bind simply takes the value and wraps it back up inside the monad, then the final result is exactly the same as if we had left the monad alone.

The third axiom is the most complicated to express, but I think it’s actually the most intuitive. Basically, this axiom is saying that if you first bind with one function, then you bind the result against another function, that’s the same as applying the first function to the value inside the monad and then calling bind on the result of that application. This isn’t exactly associativity in the classical sense, but you can sort of think about it in that way.

One of the useful consequences of the second law comes up (quite frequently) when you have one bind nested inside the another. Whenever you find yourself in that situation, you can move the nested bind outside of the outer bind, flattening your code slightly. Like so:

val opt: Option[String] = Some("string")
opt bind { str => 
  val innerOpt = Some("Head: " + str)
  innerOpt bind { str => Some(str + " :Tail") }
// is the same as...
opt bind { str => Some("Head: " + str) } bind { str => Some(str + " :Tail") }

The rewritten code has a much more “sequential” feel (the essence of monads!) and is almost always shorter than the nested form.

As I said before, the axioms are very, very intuitive once you understand the abstract concept of a monad. They may not be intuitive to express, but their consequences are very easy to understand and quite natural in practice. So, don’t spend a lot of time trying to memorize the axioms. Your time will be better spent contemplating the way that semicolon works.


Monads are not scary. They are not complex, academic or esoteric. Monads are an abstract, mathematical label affixed to a pattern found in almost all code. We all use monads every day. The hardest part in understanding monads is recognizing that the hardest part isn’t so hard after all.

I sincerely hope that this latest, dubious venture down the well-trod path of monad exposition has proven a fruitful one. I can say without hesitation that the understanding and perspective which comes from a solid grasp of monads is invaluable in very practical, down-to-earth coding (even in staid languages like Java!). Monads impart an understanding of the very fabric of sequential computation and composability, and if that isn’t sufficient motivation to learn, I don’t know what is.

Unveiling the Mysteries of GLL Part 2: The Problem Space


In the previous article, we skimmed the surface of automated text parsing and set the stage for our impending exploration of the GLL algorithm itself. However, before we can move ahead and do just that, we should first build up some idea of what the requirements are for truly generalized parsing and what sort of problems we are likely to encounter.

I’m going to assume you already have a working understanding of context-free grammars and how to read them. If you don’t, then I recommend you to the Wikipedia page on CFGs. Specifically, the examples are quite instructive.


S ::= '(' S ')'
    | '(' ')'

In this grammar, the S non-terminal is recursive because one of its productions refers back to itself. Specifically, the first rule corresponding to the S non-terminal is of the form α S β, where α and β stand for some arbitrary rule fragments (in this case, '(' and ')', respectively).

When a non-terminal maps to a production which is recursive in its first token, we say that rule is left-recursive. For example:

E ::= E '+' N
    | E '-' N
    | N

N ::= '1' | '2' | '3' | ...

In this grammar, the E non-terminal is left-recursive in two of its three productions. Left-recursion is a particularly significant property of a grammar because it means that any left-to-right parse process would need to parse E by first parsing E itself, and then parsing '+' and finally N (assuming that the parser is using the first production). As you can imagine, it would be very easy for a naïve parsing algorithm to get into an infinite loop, trying to parse E by first parsing E, which requires parsing E, which requires parsing E, etc.

Mathematically, left-recursive productions are always of the form α E β where α —> ε. In plain-English, this means that a production is left recursive if the part of the production preceding the recursive token represents the empty string (ε). This is a very nice way of defining left-recursion, because it allows for a specific type of left-recursion known as hidden left-recursion. For example:

A ::= B A '.'
    | '.'

B ::= ','

Notice how the second production for B is empty? This means that B can map to ε, and thus A exhibits hidden left-recursion. The difference between hidden and direct left-recursion is that hidden left-recursion is obscured by other rules in the grammar. If we didn’t know that B had the potential to produce the empty string, then we would never have realized that A is left-recursive.

LR parsing algorithms (such as tabular LALR or recursive-ascent) can handle direct left-recursion without a problem. However, not even Tomita’s GLR can handle hidden left-recursion (which technically means that the GLR algorithm isn’t fully general). Hidden left-recursion is a perfectly valid property for a context-free grammar to exhibit, and so in order to be fully general, a parsing algorithm must be able to handle it. As it turns out, this is just a little bit troublesome, and many papers on parsing algorithms spend a large majority of their time trying to explain how they handle hidden left-recursion.

It’s worth noting that left-recursion cannot be handled by top-down algorithms (such as tabular LL(k) or recursive-descent) without fairly significant contortions. However, such algorithms have no trouble at all with other forms of recursion (such as our original recursive example with S). Left-recursion arises very naturally in many grammars (particularly involving binary forms such as object-oriented method dispatch or mathematical operators) and is one of the primary reasons why many people prefer algorithms in the LR family over LL algorithms.


It is perhaps surprising that context-free grammars are not required to be unambiguous. This means that a grammar is allowed to accept a particular input by using more than one possible sequence of rules. The classic example of this is arithmetic associativity:

E ::= E '+' E
    | E '-' E
    | '1' | '2' | '3' | '4' | ...

This is an extremely natural way to encode the grammar for mathematical plus and minus. After all, when we mentally think about the + operator, we imagine the structure as two expressions separated by +, where an expression may be a primitive number, or a complex expression like another addition or a subtraction operation. Unfortunately, this particular encoding has a rather problematic ambiguity. Consider the following expression:

4 + 5 + 2

Clearly this is a valid expression, and a parser for the example grammar will certainly accept it as input. However, if we try to generate a parse tree for this expression, we’re going to run into two possible outcomes:

sX1lVBmkjLnLTaII_1Ugqtg.png slSIFdPcKoBgFuSZqB8FXJw.png

Literally, the question is whether or not we first expand the left or the right E in the top-level + expression. Expanding the left E will give us the tree to the left, where the first two operands (4 and 5) are added together with the final result being added to 2. Expanding the right E gives us the tree to the right, where we add 5 and 2 together, adding that to 4.

Of course, in the case of addition, associativity doesn’t matter too much, we get the same answer either way. However, if this were division, then associativity could make all the difference in the world. (4 / 5) / 2 = 0.4, but 4 / (5 / 2) = 1.6. The point is that we can follow all of the rules set forth by the grammar and arrive at two very different answers. This is the essence of ambiguity, and it poses endless problems for most parsing algorithms.

If you think about it, in order to correctly handle ambiguity, a parser would need to return not just one parse tree for a particular input, but all possible parse trees. The parser’s execution would have to keep track of all of these possibilities at the same time, somehow following each one to its conclusion maintaining its state to the very end. This is not an easy problem, particularly in the face of grammars like the following:

S ::= S S S
    | S S
    | 'a'

Clearly, this is a contrived grammar. However, it’s still a valid CFG that a generalized parsing algorithm would need to be able to handle. The problem is that there are an exponential number of possible parse trees for any given (valid) input. If the parser were to naïvely follow each and every one of these possibilities one at a time, even on a short string, the parse process would take more time than is left in the age of the universe as we know it. Obviously, that’s not an option.


As you can see, generalized parsing has some very thorny problems to solve. It’s really no wonder that the algorithms tend to be cryptic and difficult to understand. However, this is not to say that the problems are insurmountable. There are some very elegant and easy-to-understand algorithms for solving these problems, and GLL is one of them.

In the next article, we will start looking at the GLL algorithm itself, along with that chronically under-documented data structure at its core, the graph-structured stack (GSS).

Unveiling the Mysteries of GLL Part 1: Welcome to the Field


Generalized parsing is probably the most misunderstood topic in the entire field of automated language processing. There is a persistent perception that generalized parsing is slow and impractical. Even worse, most people seem to believe that generalized parsing is complicated and unpredictable (a perception deriving from the extremely obfuscated nature of most generalized parsing algorithms). This is all very unfortunate, because none of it is true anymore.

Now, before I move forward and justify that rather bold statement, I should probably define what I mean when I say “generalized parsing”. Parsing algorithms generally fall into one of any number of categories. The most common ones are:

  • LL(k) (e.g. ANTLR, JavaCC)
  • Non-Commutative LL(*) (e.g. most hand-written parsers, parser combinators)
  • Memoized Non-Commutative LL(*) (e.g. Packrat parsers, Scala 2.8 parser combinators)
  • LR(k) (e.g. YACC, Bison)
  • Generalized

These are arranged roughly in order from least to most powerful. This means that the supported expressivity of grammars increases as you go down the list. Techniques like LL don’t support left-recursion or ambiguity; LR supports left-recursion, but not ambiguity; and generalized supports anything that’s context-free. Note that this is a very rough arrangement. It’s difficult to formally analyze the non-commutative LL(*) techniques, and so theorists tend to be a little unclear as to exactly how powerful the techniques are with respect to more well defined classes like LL and LR. However, it is generally assumed that non-commutative LL(*) is strictly more powerful than LL(k) but likely less powerful than LR(k) (since left-recursion can be handled with memoization, but some LR local ambiguities do not always resolve correctly).

As intuition would suggest, algorithms are generally more complex (both in terms of comprehension and asymptotic performance) the more powerful you get. LL(k) algorithms, both the table-driven and the directly-encoded, are usually quite easy to understand. Parser states correspond directly to grammatical rules, and so it’s usually pretty easy to tease out the structure of the parser. By contrast, LR(k) algorithms (most commonly, tabular LALR and recursive-ascent) are usually very difficult to conceptualize and next to impossible to read when encoded in a programming language. One look at the recursive-ascent example on Wikipedia is sufficient to confirm this property.

Most listed non-generalized parsing techniques are O(n) in the length of the input. The one exception here is non-commutative LL(*), which is O(k^n) in the case where the grammar is recursively ambiguous and the input is invalid. Generalized parsing on the other hand has an absolute lower-bound of o(n^2) (a property which falls out of the equivalence with matrix multiplication), though no one has ever found an algorithm which can do better than O(n^3). Clearly, generalized parsing does impose a performance penalty beyond more “conventional” techniques.

For these reasons, generalized parsing is usually confined to applications which actually need to be able to handle the full set of context-free grammars — most notably, genome analysis and natural language processing. Even worse, the reticence surrounding generalized parsing has led to its avoidance in several less esoteric situations which would benefit greatly from its power — most notably, the Scala, Haskell and C/C++ compilers should use generalized parsing, but don’t.

This is really a shame, because generalized parsing benefits from two major areas of advancement in the past several decades: CPU clock speed and algorithmic improvements. The bias against generalized parsing dates back to a time when processors were slow enough that the difference between O(n) and O(n^3) was fairly significant, even on shorter input strings. It also predates several newer algorithms which encode the full power of generalized parsing in clean, elegant and understandable ways.

It’s a prejudice, plain and simple, and I plan to do something about it. In this series of articles, I will explain some of the fundamental techniques used to make generalized parsing fast and efficient, particularly as they relate to a newer algorithm known as “generalized LL” (GLL). I’ll give a brief outline of how this algorithm is implemented and how it can be easily used in Scala via the gll-combinators framework. Additionally, I will provide some motivation for why generalized parsing is so important, particularly in light of the modern trend in language design toward more and more powerful syntax. Even if you don’t agree with my conclusions, I hope you will come away from this series with a more complete understanding of the state of generalized parsing and how it can be effectively applied.

Working with Scala’s XML Support


XML is probably one of Scala’s most controversial language features (right behind unrestricted operator overloading). On the one hand, it’s very nice to be able to simply embed XML fragments and XPath-like expressions within your Scala source code. At least, it’s certainly a lot nicer than the string-literal approach that is required in many other languages. However, XML literals also complicate the syntax tremendously and pose endless difficulties for incremental syntax-aware editors such as IDEs.

Irrespective of the controversy though, XML literals are part of the language and they are here to stay. Martin Odersky has mentioned on multiple occasions that he half-regrets the inclusion of XML literal support, but he can’t really do anything about it now that the language has taken hold and the syntax has solidified. So, we may as well make the best of it…

Unfortunately, Scala’s XML library is very…weird. Especially in Scala 2.7. The class hierarchy is unintuitive, and there are odd pitfalls and correctness dangers just waiting to entrap the unwary. That fact, coupled with the lack of appropriate documentation in the language specification, leads to a very steep learning curve for new users. This is quite unfortunate, because a solid understanding of Scala’s XML support is vital to many applications of the language, most notably the Lift web framework.

I can’t personally do anything about the strangeness in the XML library. Like the literal syntax itself, it’s too late to make many fundamental changes to the way XML works in Scala. However, I can try to make it easier for beginners to get up and running with Scala’s XML support.

The Hierarchy

Before we get to literals and queries, it’s important to have some idea of the shape of Scala’s XML library and how its class hierarchy works. I found (and find) this to be the most unintuitive part of the entire ordeal.


There are actually more classes than just this (such as Document, which extends NodeSeq, and Unparsed, which extends Atom), but you get the general idea. The ones I have shown are the classes which you are most likely to use on a regular basis.

Starting from the top, NodeSeq is probably the most significant class in the entire API. The most commonly used methods in the library are defined in the NodeSeq class, and most third-party methods which work with XML usually work at the level of NodeSeq. More specifically, NodeSeq defines the \\ and \ methods, which are used for XPath selection, as well as the text method, which is used to recursively extract all text within a particular set of nodes. If you’re familiar with libraries like Nokogiri, you should be right at home with the functionality of these methods.

One particularly useful aspect of Scala’s XML library is the fact that NodeSeq extends Seq[Node]. This means that you can use standard Scala collections operations to fiddle with XML (map, flatMap, etc). Unfortunately, more often than not, these methods will return something of type Seq[_], rather than choosing the more specific NodeSeq when possible. This is something which could have been solved in Scala 2.8, but has not been as of the latest nightly. Until this design flaw is rectified, the only recourse is to use the NodeSeq.fromSeq utility method to explicitly convert anything of type Seq[Node] back into the more specific NodeSeq as necessary:

val nodes: Seq[Node] = ...
val ns: NodeSeq = NodeSeq fromSeq nodes

Immediately deriving from NodeSeq is another landmark class in the Scala API, Node. At first glance, this may seem just a bit weird. After all, Node inherits from NodeSeq which in turn inherits from Seq[Node]. Thus, a single Node can also be viewed as a NodeSeq of length one, containing exactly itself. Yeah, that one took me a while…

Everything in the Scala XML library is a NodeSeq, and almost everything is a Node. If you remember this fact, then you understand the entire API. The Elem class represents a single XML element with associated attributes and a child NodeSeq (which may of course be empty). The Group class is a bit of a hack and should never be used directly (use NodeSeq.fromSeq instead).

Of the SpecialNode hierarchy, only Atom deserves any special attention, and of its children, Text is really the most significant. Text is simply the way in which the Scala XML library represents text fragments within XML. Clearly, XML elements can have textual content, but since the child(ren) of an Elem have to be Node(s), we need some way of wrapping up a text String as a Node. This is where Text comes in.

It is worth noting that the Atom class actually takes a single type parameter. Text inherits from Atom[String]. I find this aspect of the API just a bit odd, since there aren’t any subclasses of Atom which inherit from anything other than Atom[String], but that’s just the way it is.


Now that we’ve got the fundamental class hierarchy out of the way, it’s time to look at the most visible aspect of Scala’s XML support: XML literals. Most Scala web frameworks tend to make heavy use of XML literals, which can be a bit annoying due to the difficulties they cause most editors (I’m still trying to get the jEdit support nailed down). Even still, XML literals are a very useful part of the language and almost essential if you’re going to be working with XML content.

Fortunately, Scala’s XML syntax is as intuitive to write as it is difficult to parse:

val ns = <span id="foo"><strong>Hello,</strong> World!</span>
println(ns.toString)      // prints the raw XML

The thing to remember is that any time text appears after the < operator without any whitespace, Scala’s parser will jump into “XML mode”. Thus, the following code is invalid, even though it seems like it should work:

val foo = new {
  def <(a: Any) = this
foo <foo          // error!

<rant>This is yet another example of Scala’s compiler behaving in strange and unintuitive ways due to arbitrary resolution of ambiguity in the parser. The correct way to handle this would be for the parser to accept the local ambiguity (XML literal vs operator and value reference) and defer the resolution until a later point. In this case, the final parse tree would be unambiguous (there is no way this could correctly parse as an XML fragment), so there’s no danger of complicating later phases like the type checker. Unfortunately, Scala’s parser (as it stands) is not powerful enough to handle this sort of functionality. *sigh*</rant>

Scala’s XML literal syntax is actually sugar for a series of Elem and Text instantiations. Specifically, Scala will parse our earlier example as the following:

val ns = Elem(null, "span", new UnprefixedAttribute("id", Text("foo"), Null), TopScope, 
  Elem(null, "strong", Null, TopScope, Text("Hello,")), Text(" World!"))

You will notice that the attribute value is actually wrapped in a Text node. This is necessary because attributes can be returned from XPath selectors, which always return values of type NodeSeq. Thus, the content of an attribute must be of type Node. Unfortunately, this opens up a rather obvious hole in the type safety of the API: the compiler will allow you to store any Node within an attribute, including something of type Elem. In fact, you won’t even get an exception at runtime! The following code compiles and runs just fine:

new UnprefixedAttribute("id", <foo/>, Null)

The good news is that you will almost never use UnprefixedAttribute directly, mostly because the API is so clumsy. Most of the time, you will spend your time either consuming pre-baked XML coming in from some external source, or synthesizing it yourself using literals.

Of course, not all XML is fully-known at compile time. In fact, most often XML is just a structured wrapper around some data which is produced dynamically. To that end, Scala provides a convenient syntax for XML interpolation. This makes it possible to construct XML dynamically based on variables and expressions. For example, we might want to make the id attribute of the foo element dynamic based on some method parameter:

def makeXML(id: String) = <span id={ id }><strong>Hello,</strong> World!</span>
makeXML("foo")        // => <span id="foo">...</span>

The interpolation syntax is actually fairly generous about what you are allowed to embed. By default, any values within the { ... } markers will first be converted to a String (using its toString method) and then wrapped in a Text before embedding in the XML. However, if the expression within the braces is already of type NodeSeq, the interpolation will simply embed that value without any conversion. For example:

val ns1 = <foo/>
val ns2 = <bar>{ ns1 }</bar>       // => <bar><foo/></bar>

You can even embed something of type Seq[Node] and the interpolation will “do the right thing”, flattening the sequence into an XML fragment which takes the place of the interpolated segment:

val xs = List(<foo/>, <bar/>)
val ns = <baz>{ xs }</baz>          // => <baz><foo/><bar/></baz>

These auto-magical interpolation features are incredibly useful when assembling XML from multiple sources. Their only downside is the fact that the Eclipse IDE v2.7 really struggles with XML literals and interpolated expressions in particular. My recommendation: if you need to work with XML literals, either avoid Eclipse entirely or be careful to wrap all XML literals in parentheses (like this: (<foo><bar/></foo>)). Note that the 2.8 version of the Scala IDE for Eclipse doesn’t impose this requirement.


Of course, creating XML is really only half the story. In fact, it’s actually much less than that. In practice, most XML-aware applications spend the majority of their time processing XML, not synthesizing it. Fortunately, the Scala XML API provides some very nice functionality in this department.

For starters, it is possible to perform XPath-like queries. I say “XPath-like” because it’s really not quite as nice as XPath, nor as full-featured. Sometimes it takes several chained queries to perform the same action as a single, compound XPath query. However, despite its shortcomings, Scala’s XPath support is still dramatically superior to manual DOM walking or SAX handling.

The most fundamental XML query operator is \ (bear in mind that all XML operators are defined on NodeSeq). This operator applies a given String pattern to the direct descendants of the target NodeSeq. For example:

val ns = <foo><bar><baz/>Text</bar><bin/></foo>
ns \ "foo"              // => <foo><bar>...</bar><bin/></foo>
ns \ "foo" \ "bar"      // => <bar><baz/>Text</bar>

As you can see, the most generic pattern which can be fed into the \ operator is simply the name of the element. All XML operators return NodeSeq, and so it’s very easy and natural to chain multiple operators together to perform chained queries.

However, we don’t always want to chain scores of \ operators together to get at a single deeply-nested element. In this case, we might be better served by the \\ operator:

val ns = <foo><bar><baz/>Text</bar><bin/></foo>
ns \\ "bar"          // => <bar><baz/>Test</bar>
ns \\ "baz"          // => <baz/>

Essentially, \\ behaves exactly the same as \ except that it recurses into the node structure. It will return all possible matches to a particular pattern within a given NodeSeq. Thus, if a pattern matches a containing element as well as one of its children, both will be returned:

val ns = <foo><foo/><foo>
ns \\ "foo"          // => <foo><foo/></foo><foo/>

The NodeSeq returned from the ns \\ "foo" query above actually has two elements in it: <foo><foo/></foo> as well as <foo/>. This sort of recursive searching is very useful for drilling down into deeply nested structures, but its unconstrained nature makes it somewhat dangerous if you aren’t absolutely sure of the depth of your tree. Just as a tip, I generally confine myself to \ unless I know that the node name in question is truly unique across the entire tree.

In addition to simply selecting elements, Scala also makes it possible to fetch attribute names using its XML selectors. This is done by prefixing the name of the attribute with ‘@‘ in the selector pattern:

val ns = <foo id="bar"/>
ns \ "@id"        // => Text(bar)

One minor gotcha in this department: the \ always returns something of type NodeSeq. Thus, the results of querying an attribute value are actually of type Text. If you want to get a String out of an attribute (and most of us do), you will need to use the text method:

(ns \ "@id").text         // => "bar"

Take care though that your selector is only returning a single Text node, otherwise invoking the text method will concatenate the results together. For example:

val ns = <foo><bar id="1"/><bar id="2"/></foo>
(ns \\ "@id").text          // => "12"

Unlike XPath, Scala does not allow you to query for specific attribute values (e.g. "@id=1" or similar). In order to achieve this functionality, you would need to first query for all id values and then find the one you want:

ns \\ "@id" find { _.text == "1" }        // => Some("1")

Also unlike XPath, Scala does not allow you to query for attributes associated with a particular element name in a single pattern. Thus, if you want to find only the id attributes from bar elements, you will need to perform two chained selections:

ns \\ "bar" \ "@id"

Oh, and one fun added tidbit, Scala’s XML selectors also define a wildcard character, underscore (_) of course, which can be used to substitute for any element name. However, this wildcard cannot be used in attribute patterns, nor can it be mixed into a partial name pattern (e.g. ns \ "b_" will not work). Really, the wildcard is useful in conjunction with a purely-\ pattern when attempting to “skip” a level in the tree without filtering for a particular element name.

Despite all of these shortcomings, Scala’s almost-XPath selectors are still very useful. With a little bit of practice, they can be an extremely effective way of getting at XML data at arbitrary tree depths.

Pattern Matching

What nifty Scala feature would be complete without some form of pattern matching? We can match on String literals, Int literals, and List literals; why not XML?

<foo/> match {           // prints "foo"
  case <foo/> => println("foo")
  case <bar/> => println("bar")

As we would expect, this code evaluates and prints foo to standard out. Unfortunately, things are not all sunshine and roses. In fact, pattern matching is where Scala’s XML support gets decidedly weird. Consider:

<foo>bar</foo> match {   // throws a MatchError!
  case <foo/> => println("foo")
  case <bar/> => println("bar")

The problem is that when we define the pattern, <foo/>, we’re actually telling the pattern matcher to match on exactly an empty Elem with label, foo. Of course, we can fix this by adding the appropriate to our pattern:

<foo>bar</foo> match {   // prints "foo"
  case <foo>bar</foo> => println("foo")
  case <bar>bar</bar> => println("bar")

Ok, that’s a little better, but we rarely know exactly what the contents of a particular node is going to be. In fact, the whole reason we’re pattern matching on this stuff is to extract data we don’t already have, so maybe a more useful case would be matching on the foo element and printing out its contents:

<foo>mystery</foo> match {   // prints "foo: mystery"
  case <foo>{ txt }</foo> => println("foo: " + txt)
  case <bar>{ txt }</bar> => println("bar: " + txt)

Ok, that worked, and it used our familiar interpolation syntax. Let’s try something fancier. What if we have text and an element inside our Elem?

<foo>mystery<bar/></foo> match {   // throws a MatchError!
  case <foo>{ txt }</foo> => println("foo: " + txt)
  case <bar>{ txt }</bar> => println("bar: " + txt)

Like I said, decidedly weird. The problem is that the txt pattern is looking for one Node and one Node only. The Elem we’re feeding into this pattern has two child Node(s) (a Text and an Elem), so it doesn’t match any of the patterns and throws an error.

The solution is to remember the magic of Scala’s @ symbol within patterns:

<foo>mystery<bar/></foo> match {   // prints "foo: ArrayBuffer(mystery,<bar></bar>)"
  case <foo>{ ns @ _* }</foo> => println("foo: " + ns)
  case <bar>{ ns @ _* }</bar> => println("bar: " + ns)

Closer, but still not right. If we were to examine the types here, we would see that ns is actually not a NodeSeq, but a Seq[Node]. This means that even if we weren’t naïvely printing out our match results, we would still have problems attempting to use XML selectors or other NodeSeq-like operations on ns.

To get around this problem, we have to explicitly wrap our results in a NodeSeq using the utility method mentioned earlier:

<foo>mystery<bar/></foo> match {   // prints "foo: mystery<bar></bar>"
  case <foo>{ ns @ _* }</foo> => println("foo: " + NodeSeq.fromSeq(ns))
  case <bar>{ ns @ _* }</bar> => println("bar: " + NodeSeq.fromSeq(ns))

Success at last! Now let’s try some attributes. To make things easier, we’ll pattern match on static values rather than trying to actually extract data:

<foo id="bar"/> match {
  case <foo id="bar"/> => println("bar")      // does not compile!
  case <foo id="baz"/> => println("baz")

As the comment says, this snippet doesn’t compile. Why? Because Scala doesn’t support XML patterns with attributes. This is a horrible restriction and one that I run up against almost daily. Even from a strictly philosophical sense, pattern matching should be symmetric with the literal syntax (just like List and the :: operator). We’ve already seen one instance of asymmetry in XML pattern matching (child extraction), but this one is far worse.

The only way to pattern match in an attribute-aware sense is to use pattern guards to explicitly query for the attribute in question. This leads to vastly more obfuscated patterns like the one shown below:

<foo id="bar"/> match {       // prints "bar"
  case n @ <foo/> if (n \ "@id" text) == "bar" => println("bar")
  case n @ <foo/> if (n \ "@id" text) == "baz" => println("baz")

This situation is also somewhat confusing when attempting to read code which uses pattern matching and branches on attributes. I’m constantly tripping over this when I look back at even my own code, mostly because it looks for all the world like we’re matching on a foo element with no attributes! Very frustrating.

Oh, and one final added goodie: namespaces. Pattern matching on an unqualified element (e.g. <foo/>) will match not only exactly that element name, but also any namespaced permutations thereof:

<w:gadget/> match {       // prints "gadget"
  case <gadget/> => println("gadget")

If you want to match a specific namespace, you need to include it in the pattern:

<w:gadget/> match {       // prints "w:gadget"
  case <m:gadget/> => println("m:gadget")
  case <w:gadget/> => println("w:gadget")

In practice, this is actually fairly useful, but it’s still another head-scratcher in the Scala XML design. I know I struggled with this as a beginner, and I can’t imagine it’s that much easier for anyone else.

Concurrency Pitfalls

One thing we (at Novell) learned the hard way is that Scala’s XML library is not thread-safe. Yes, XML literals are immutable, but this alone is not sufficient. Even though the API is immutable (doesn’t provide a way to change an XML literal in-place), the underlying data structures are not. Observant readers will have caught this fact from our pattern matching example earlier, when we mistakenly printed “ArrayBuffer(mystery,<bar></bar>)“.

ArrayBuffer is a little like Scala’s answer to Java’s ArrayList. It’s pretty much the defacto mutable Seq implementation. Under the surface, it’s using an asymptotically-growing dynamic array to store its data, providing constant-time read and append. Unfortunately, like all array-based data structures, ArrayBuffer suffers from volatility issues. Unsynchronized use across multiple threads involving mutation (even copy mutation like the ++ method) can result in undefined behavior.

The good news is that this problem is fixed in Scala 2.8. The bad news is that a lot of people are still stuck on 2.7. For now, the only solution is to ensure that you never access a single XML value concurrently. This either requires locking or extra data copying to ensure that no two threads have the same copy of a particular NodeSeq. Needless to say, neither solution is ideal.


Scala’s XML support is flaky, inconsistent and arguably a bad idea in the first place. However, the fact that it’s already part of the language means that it’s a little late to bring up inherent design flaws. Instead, we should focus on all that’s good about the library, like the convenience of a very straightforward literal syntax and the declarative nature of almost-XPath selectors. I may not like everything about Scala’s XML support — for that matter, I may not like most of Scala’s XML support — but I can appreciate the benefits to XML-driven applications and libraries such as Lift. Hopefully, this brief guide will help you avoid some of the pitfalls and reap the rewards of XML in Scala with a minimum of casualties.

Understanding and Applying Operational Transformation


Almost exactly a year ago, Google made one of the most remarkable press releases in the Web 2.0 era. Of course, by “press release”, I actually mean keynote at their own conference, and by “remarkable” I mean potentially-transformative and groundbreaking. I am referring of course to the announcement of Google Wave, a real-time collaboration tool which has been in open beta for the last several months.

For those of you who don’t know, Google Wave is a collaboration tool based on real-time, simultaneous editing of documents via a mechanism known as “operational transformation”. Entities which appear as messages in the Wave client are actually “waves”. Within each “wave” is a set of “wavelets”, each of which contains a set of documents. Individual documents can represent things like messages, conversation structure (which reply goes where, etc), spell check metadata and so on. Documents are composed of well-formed XML with an implicit root node. Additionally, they carry special metadata known as “annotations” which are (potentially-overlapping) key/value ranges which span across specific regions of the document. In the Wave message schema, annotations are used to represent things like bold/italic/underline/strikethrough formatting, links, caret position, the conversation title and a host of other things. An example document following the Wave message schema might look something like this:

  <line/>Test message
  <line/>Lorem ipsum dolor sit amet.

(assuming the following annotations):

  • style/font-weight -> bold
  • style/font-style -> italic
  • link/manual ->

You will notice that the annotations for style/font-style and link/manual actually overlap. This is perfectly acceptable in Wave’s document schema. The resulting rendering would be something like this:

Test message

Lorem ipsum dolor sit amet.

The point of all this explaining is to give you at least a passing familiarity with the Wave document schema so that I can safely use its terminology in the article to come. See, Wave itself is not nearly so interesting as the idea upon which it is based. As mentioned, every document in Wave is actually just raw XML with some ancillary annotations. As far as the Wave server is concerned, you can stuff whatever data you want in there, just so long as it’s well-formed. It just so happens that Google chose to implement a communications tool on top of this data backend, but they could have just as easily implemented something more esoteric, like a database or a windowing manager.

The key to Wave is the mechanism by which we interact with these documents: operational transformation. Wave actually doesn’t allow you to get access to a document as raw XML or anything even approaching it. Instead, it demands that all of your access to the document be performed in terms of operations. This has two consequences: first, it allows for some really incredible collaborative tools like the Wave client; second, it makes it really tricky to implement any sort of Wave-compatible service. Given the fact that I’ve been working on Novell Pulse (which is exactly this sort of service), and in light of the fact that Google’s documentation on the subject is sparing at best, I thought I would take some time to clarify this critical piece of the puzzle. Hopefully, the information I’m about to present will make it easier for others attempting to interoperate with Wave, Pulse and the (hopefully) many OT-based systems yet to come.


Intuitively enough, the fundamental building block of operational transforms are operations themselves. An operation is exactly what it sounds like: an action which is to be performed on a document. This action could be inserting or deleting characters, opening (and closing!) an XML element, fiddling with annotations, etc. A single operation may actually perform many of these actions. Thus, an operation is actually made up of a sequence of operation components, each of which performs a particular action with respect to the cursor (not to be confused with the caret, which is specific to the client editor and not at all interesting at the level of OT).

There are a number of possible component types. For example:

  • insertCharacters — Inserts the specified string at the current index
  • deleteCharacters — Deletes the specified string from the current index
  • openElement — Creates a new XML open-tag at the current index
  • deleteOpenElement — Deletes the specified XML open-tag from the current index
  • closeElement — Closes the first currently-open tag at the current index
  • deleteCloseElement — Deletes the XML close-tag at the current index
  • annotationBoundary — Defines the changes to any annotations (starting or ending) at the current index
  • retain — Advances the index a specified number of items

Wave’s OT implementation actually has even more component types, but these are the important ones. You’ll notice that every component has something to do with the cursor index. This concept is central to Wave’s OT implementation. Operations are effectively a stream of components, each of which defines an action to be performed which effects the content, the cursor or both. For example, we can encode the example document from earlier as follows:

  1. openElement('body')
  2. openElement('line')
  3. closeElement()
  4. annotationBoundary(startKeys: ['style/font-weight'], startValues: ['bold'])
  5. insertCharacters('Test message')
  6. annotationBoundary(endKeys: ['style/font-weight'])
  7. openElement('line')
  8. closeElement()
  9. annotationBoundary(startKeys: ['style/font-style'], startValues: ['italic'])
  10. openElement('line')
  11. closeElement()
  12. insertCharacters('Lorem ')
  13. annotationBoundary(startKeys: ['link/manual'], startValues: [''])
  14. insertCharacters('ipsum')
  15. annotationBoundary(endKeys: ['style/font-style'])
  16. insertCharacters(' dolor')
  17. annotationBoundary(endKeys: ['link/manual'])
  18. insertCharacters(' sit amet.')
  19. closeElement()

Obviously, this isn’t the most streamlined way of referring to a document’s content for a human, but a stream of discrete components like this is perfect for automated processing. The real utility of this encoding though doesn’t become apparent until we look at operations which only encode a partial document; effectively performing a particular mutation. For example, let’s follow the advice of Strunk and White and capitalize the letter ‘m’ in our title of ‘Test message’. What we want to do (precisely-speaking) is delete the ‘m’ and insert the string ‘M’ at its previous location. We can do that with the following operation:

  1. retain(8)
  2. deleteCharacters('m')
  3. insertCharacters('M')
  4. retain(38)

Instead of adding content to the document at ever step, most of this operation actually leaves the underlying document untouched. In practice, retain() tends to be the most commonly used component by a wide margin. The trick is that every operation must span the full width of the document. When evaluating this operation, the cursor will start at index 0 and walk forward through the existing document and the incoming operation one item at a time. Each XML tag (open or close) counts as a single item. Characters are also single items. Thus, the entire document contains 47 items.

Our operation above cursors harmlessly over the first eight items (the <body> tag, the <line/> tag and the string 'Test '). Once it reaches the 'm' in 'message', we stop the cursor and perform a mutation. Specifically, we’re using the deleteCharacters() component to remove the 'm'. This component doesn’t move the cursor, so we’re still sitting at index 8. We then use the insertCharacters() component to add the character 'M' at precisely our currently location. This time, some new characters have been inserted, so the cursor advances to the end of the newly-inserted string (meaning that we are now at index 9). This is intuitive because we don’t want to have to retain() over the text we just inserted. We do however want to retain() over the remainder of the document, seeing as we don’t need to do anything else. The final rendered document looks like the following:

Test Message

Lorem ipsum dolor sit amet.


One of Google’s contributions to the (very old) theory behind operational transformation is the idea of operation composition. Because Wave operations are these nice, full-span sequences of discrete components, it’s fairly easy to take two operations which span the same length and merge them together into a single operation. The results of this action are really quite intuitive. For example, if we were to compose our document operation (the first example above) with our 'm'-changing operation (the second example), the resulting operation would be basically the same as the original document operation, except that instead of inserting the text 'Test message', we would insert 'Test Message'. In composing the two operations together, all of the retains have disappeared and any contradicting components (e.g. a delete and an insert) have been directly merged.

Composition is extremely important to Wave’s OT as we will see once we start looking at client/server asymmetry. The important thing to notice now is the fact that composed operations must be fundamentally compatible. Primarily, this means that the two operations must span the same number of indexes. It also means that we cannot compose an operation which consists of only a text insert with an operation which attempts to delete an XML element. Obviously, that’s not going to work. Wave’s Composer utility takes care of validating both the left and the right operation to ensure that they are compatible as part of the composition process.

Please also note that composition is not commutative; ordering is significant. This is also quite intuitive. If you type the character a and then type the character b, the result is quite different than if you type the character b and then type the character a.


Here’s where we get to some of the really interesting stuff and the motivation behind all of this convoluted representational baggage. Operational Transformation, at its core, is an optimistic concurrency control mechanism. It allows two editors to modify the same section of a document at the same time without conflict. Or rather, it provides a mechanism for sanely resolving those conflicts so that neither user intervention nor locking become necessary.

This is actually a harder problem than it sounds. Imagine that we have the following document (represented as an operation):

  1. insertCharacters('go')

Now imagine that we have two editors with their cursors positioned at the end of the document. They simultaneously insert a t and a character (respectively). Thus, we will have two operations sent to the server. The first will retain 2 items and insert a t, the second will retain 2 items and insert a. Naturally, the server needs to enforce atomicity of edits at some point (to avoid race conditions during I/O), so one of these operations will be applied first. However, as soon as either one of these operations is applied, the retain for the other will become invalid. Depending on the ordering, the text of the resulting document will either be 'goat' or 'gota'.

In and of itself, this isn’t really a problem. After all, any asynchronous server needs to make decisions about ordering at some point. However, issues start to crop up as soon as we consider relaying operations from one client to the other. Client A has already applied its operation, so its document text will be 'got'. Meanwhile, client B has already applied its operation, and so its document text is 'goa'. Each client needs the operation from the other in order to have any chance of converging to the same document state.

Unfortunately, if we naïvely send A’s operation to B and B’s operation to A, the results will not converge:

  • 'got' + (retain(2); insertCharacters('a') = 'goat'
  • 'goa' + (retain(2); insertCharacters('t') = 'gota'

Even discounting the fact that we have a document size mismatch (our operations each span 2 indexes, while their target documents have width 3), this is obviously not the desired behavior. Even though our server may have a sane concept of consistent ordering, our clients obviously need some extra hand-holding. Enter OT.

What we have here is a simple one-step diamond problem. In the theoretical study of OT, we generally visualize this situation using diagrams like the following:


The way you should read diagrams like this is as a graphical representation of operation application on two documents at the same time. Client operations move the document to the left. Server operations move the document to the right. Both client and server operations move the document downward. Thus, diagrams like these let us visualize the application of operations in a literal “state space”. The dark blue line shows the client’s path through state space, while the gray line shows the server’s. The vertices of these paths (not explicitly rendered) are points in state space, representing a particular state of the document. When both the client and the server line pass through the same point, it means that the content of their respective documents were in sync, at least at that particular point in time.

So, in the diagram above, operation a could be client A’s operation (retain(2); insertCharacters('t')) and operation b could be client B’s operation. This is of course assuming that the server chose B’s operation as the “winner” of the race condition. As we showed earlier, we cannot simply naïvely apply operation a on the server and b on the client, otherwise we could derive differing document states ('goat' vs 'gota'). What we need to do is automatically adjust operation a with respect to b and operation b with respect to a.

We can do this using an operational transform. Google’s OT is based on the following mathematical identity:

xform(a, b) = (a', b'),\mbox{ where }b' \circ a \equiv a' \circ b

In plain English, this means that the transform function takes two operations, one server and one client, and produces a pair of operations. These operations can be applied to their counterpart’s end state to produce exactly the same state when complete. Graphically, we can represent this by the following:


Thus, on the client-side, we receive operation b from the server, pair it with a to produce (a’, b’), and then compose b’ with a to produce our final document state. We perform an analogous process on the server-side. The mathematical definition of the transform function guarantees that this process will produce the exact same document state on both server and client.

Coming back to our concrete example, we can finally solve the problem of 'goat' vs 'gota'. We start out with the situation where client A has applied operation a, arriving at a document text of 'got'. It now receives operation b from the server, instructing it to retain over 2 items and insert character 'a'. However, before it applies this operation (which would obviously result in the wrong document state), it uses operational transformation to derive operation b’. Google’s OT implementation will resolve the conflict between 't' and 'a' in favor of the server. Thus, b' will consist of the following components:

  1. retain(2)
  2. insertCharacters('a')
  3. retain(1)

You will notice that we no longer have a document size mismatch, since that last retain() ensures that the cursor reaches the end of our length-3 document state ('got').

Meanwhile, the server has received our operation a and it performs an analogous series of steps to derive operation a’. Once again, Google’s OT must resolve the conflict between 't' and 'a' in the same way as it resolved the conflict for client A. We’re trying to apply operation a (which inserts the 't' character at position 2) to the server document state, which is currently 'goa'. When we’re done, we must have the exact same document content as client A following the application of b’. Specifically, the server document state must be 'goat'. Thus, the OT process will produce the operation a’ consisting of the following components:

  1. retain(3)
  2. insertCharacters('t')

Client A applies operation b’ to its document state, the server applies operation a’ to its document state, and they both arrive at a document consisting of the text 'goat'. Magic!

It is very important that you really understand this process. OT is all about the transform function and how it behaves in this exact situation. As it turns out, this is all that OT does for us in and of itself. Operational transformation is really just a concurrency primitive. It doesn’t solve every problem with collaborative editing of a shared document (as we will see in a moment), but it does solve this problem very well.

One way to think of this is to keep in mind the “diamond” shape shown in the above diagram. OT solves a very simple problem: given the top two sides of the diamond, it can derive the bottom two sides. In practice, often times we only want one side of the box (e.g. client A only needs operation b’, it doesn’t need a’). However, OT always gives us both pieces of the puzzle. It “completes” the diamond, so to speak.

Compound OT

So far, everything I have presented has come pretty directly from the whitepapers on However, contrary to popular belief, this is not enough information to actually go out and implement your own collaborative editor or Wave-compatible service.

The problem is that OT doesn’t really do all that much in and of itself. As mentioned above, OT solves for two sides of the diamond in state space. It only solves for two sides of a simple, one-step diamond like the one shown above. Let me say it a third time: the case shown above is the only case which OT handles. As it turns out, there are other cases which arise in a client/server collaborative editor like Google Wave or Novell Pulse. In fact, most cases in practice are much more complex than the one-step diamond.

For example, consider the situation where the client performs two operations (say, by typing two characters, one after the other) while at the same time the server performs one operation (originating from another client). We can diagram this situation in the following way:


So we have two operations in the client history, a and b, and only one operation in the server history, c. The client is going to send operations a and b to the server, presumably one after the other. The first operation (a) is no problem at all. Here we have the simple one-step diamond problem from above, and as well know, OT has no trouble at all in resolving this issue. The server transforms a and c to derive operation a’, which it applies to its current state. The resulting situation looks like the following:


Ok, so far so good. The server has successfully transformed operation a against c and applied the resulting a’ to its local state. However, the moment we move on to operation b, disaster strikes. The problem is that the server receives operation b, but it has nothing against which to transform it!

Remember, OT only solves for the bottom two sides of the diamond given the top two sides. In the case of the first operation (a), the server had both top sides (a and c) and thus OT was able to derive the all-important a’. However, in this case, we only have one of the sides of the diamond (b); we don’t have the server’s half of the equation because the server never performed such an operation!

In general, the problem we have here is caused by the client and server diverging by more than one step. Whenever we get into this state, the OT becomes more complicated because we effectively need to transform incoming operations (e.g. b) against operations which never happened! In this case, the phantom operation that we need for the purposes of OT would take us from the tail end of a to the tail end of a’. Think of it like a “bridge” between client state space and server state space. We need this bridge, this second half of the diamond, if we are to apply OT to solve the problem of transforming b into server state space.

Operation Parentage

In order to do this, we need to add some metadata to our operations. Not only do our operations need to contain their components (retain, etc), they also must maintain some notion of parentage. We need to be able to determine exactly what state an operation requires for successful application. We will then use this information to detect the case where an incoming operation is parented on a state which is not in our history (e.g. b on receipt by the server).

For the record, Google Wave uses a monotonically-increasing scalar version number to label document states and thus, operation parents. Novell Pulse does the exact same thing for compatibility reasons, and I recommend that anyone attempting to build a Wave-compatible service follow the same model. However, I personally think that compound OT is a lot easier to understand if document states are labeled by a hash of their contents.

This scheme has some very nice advantages. Given an operation (and its associated parent hash), we can determine instantly whether or not we have the appropriate document state to apply said operation. Hashes also have the very convenient property of converging exactly when the document states converge. Thus, in our one-step diamond case from earlier, operations a and b would be parented off of the same hash. Operation b’ would be parented off of the hash of the document resulting from applying a to the initial document state (and similarly for a’). Finally, the point in state space where the client and server converge once again (after applying their respective operations) will have a single hash, as the document states will be synchronized. Thus, any further operations applied on either side will be parented off of a correctly-shared hash.

Just a quick terminology note: when I say “parent hash”, I’m referring to the hash of the document state prior to applying a particular operation. When I say “parent operation” (which I probably will from time to time), I’m referring to the hash of the document state which results from applying the “parent operation” to its parent document state. Thus, operation b in the diagram above is parented off of operation a which is parented off of the same hash as operation c.

Compound OT

Now that our operations have parent information, our server is capable of detecting that operation b is not parented off of any state in its history. What we need to do is derive an operation which will take us from the parent of b to some point in server state-space. Graphically, this operation would look something like the following (rendered in dark green):


Fortunately for us, this operation is fairly easy to derive. In fact, we already derived and subsequently threw it away! Remember, OT solves for two sides of the diamond. Thus, when we transformed a against c, the resulting operation pair consisted of a’ (which we applied to our local state) and another operation which we discarded. That operation is precisely the operation shown in green above. Thus, all we have to do is re-derive this operation and use it as the second top side of the one-step diamond. At this point, we have all of the information we need to apply OT and derive b’, which we can apply to our local state:


At this point, we’re almost done. The only problem we have left to resolve is the application of operation c on the client. Fortunately, this is a fairly easy thing to do; after all, c is parented off of a state which the client has in its history, so it should be able to directly apply OT.

The one tricky point here is the fact that the client must transform c against not one but two operations (a and b). Fortunately, this is fairly easy to do. We could apply OT twice, deriving an intermediary operation in the first step (which happens to be exactly equivalent to the green intermediary operation we derived on the server) and then transforming that operation against b. However, this is fairly inefficient. OT is fast, but it’s still O(n log n). The better approach is to first compose a with b and then transform c against the composition of the two operations. Thanks to Google’s careful definition of operation composition, this is guaranteed to produce the same operation as we would have received had we applied OT in two separate steps.

The final state diagram looks like the following:


Client/Server Asymmetry

Technically, what we have here is enough to implement a fully-functional client/server collaborative editing system. In fact, this is very close to what was presented in the 1995 paper on the Jupiter collaboration system. However, while this approach is quite functional, it isn’t going to work in practice.

The reason for this is in that confusing middle part where the server had to derive an intermediary operation (the green one) in order to handle operation b from the client. In order to do this, the server needed to hold on to operation a in order to use it a second time in deriving the intermediary operation. Either that, or the server would have needed to speculatively retain the intermediary operation when it was derived for the first time during the transformation of a to a’. Now, this may sound like a trivial point, but consider that the server must maintain this sort of information essentially indefinitely for every client which it handles. You begin to see how this could become a serious scalability problem!

In order to solve this problem, Wave (and Pulse) imposes a very important constraint on the operations incoming to the server: any operation received by the server must be parented on some point in the server’s history. Thus, the server would have rejected operation b in our example above since it did not branch from any point in server state space. The parent of b was a, but the server didn’t have a, it only had a’ (which is clearly a different point in state space).

Of course, simply rejecting any divergence which doesn’t fit into the narrow, one-step diamond pattern is a bit harsh. Remember that practically, almost all situations arising in collaborative editing will be multi-step divergences like our above example. Thus, if we naïvely rejected anything which didn’t fit into the one-step mold, we would render our collaborative editor all-but useless.

The solution is to move all of the heavy lifting onto the client. We don’t want the server to have to track every single client as it moves through state space since there could be thousands (or even millions) of clients. But if you think about it, there’s really no problem with the client tracking the server as it moves through state space, since there’s never going to be any more than one (logical) server. Thus, we can offload most of the compound OT work onto the client side.

Before it sends any operations to the server, the client will be responsible for ensuring those operations are parented off of some point in the server’s history. Obviously, the server may have applied some operations that the client doesn’t know about yet, but that’s ok. As long as any operations sent by the client are parented off of some point in the server’s history, the server will be able to transform that incoming operation against the composition of anything which has happened since that point without tracking any history other than its own. Thus, the server never does anything more complicated than the simple one-step diamond divergence (modulo some operation composition). In other words, the server can always directly apply OT to incoming operations, deriving the requisite operation extremely efficiently.

Unfortunately, not all is sunshine and roses. Under this new regime, the client needs to work twice as hard, translating its operations into server state space and (correspondingly) server operations back into its state space. We haven’t seen an example of this “reverse” translation (server to client) yet, but we will in a moment.

In order to maintain this guarantee that the client will never send an operation to the server which is not parented on a version in server state space, we need to impose a restriction on the client: we can never send more than one operation at a time to the server. This means that as soon as the client sends an operation (e.g. a in the example above), it must wait on sending b until the server acknowledges a. This is necessary because the client needs to somehow translate b into server state space, but it can’t just “undo” the fact that b is parented on a. Thus, wherever b eventually ends up in server state space, it has to be a descendant of a’, which is the server-transformed version of a. Literally, we don’t know where to translate b into until we know exactly where a fits in the server’s history.

To help shed some light into this rather confusing scheme, let’s look at an example:


In this situation, the client has performed two operations, a and b. The client immediately sends operation a to the server and buffers operation b for later transmission (the lighter blue line indicates the buffer boundary). Note that this buffering in no way hinders the application of local operations. When the user presses a key, we want the editor to reflect that change immediately, regardless of the buffer state. Meanwhile, the server has applied two other operations, c and d, which presumably come from other clients. The server still hasn’t received our operation a.

Note that we were able to send a immediately because we are preserving every bit of data the server sends us. We still don’t know about c and d, but we do know that the last time we heard from the server, it was at the same point in state space as we were (the parent of a and c). Thus, since a is already parented on a point in server state space, we can just send it off.

Now let’s fast-forward just a little bit. The server receives operation a. It looks into its history and retrieves whatever operations have been applied since the parent of a. In this case, those operations are c and d. The server then composes c and d together and transforms a against the result, producing a’.


After applying a’, the server broadcasts the operation to all clients, including the one which originated the operation. This is a very important design feature: whenever the server applies a transformed operation, it sends that operation off to all of its clients without delay. As long as we can guarantee strong ordering in the communication channels between the client and the server (and often we can), the clients will be able to count on the fact that they will receive operations from the server in exactly the order in which the server applied them. Thus, they will be able to maintain a locally-inferred copy of the server’s history.

This also means that our client is going to receive a’ from the server just like any other operation. In order to avoid treating our own transformed operations as if they were new server operations, we need some way of identifying our own operations and treating them specially. To do this, we add another bit of metadata to the operation: a locally-synthesized unique ID. This unique ID will be attached to the operation when we send it to the server and preserved by the server through the application of OT. Thus, operation a’ will have the same ID as operation a, but a very different ID from operations c and d.

With this extra bit of metadata in place, clients are able to distinguish their own operations from others sent by the server. Non-self-initiated operations (like c and d) must be translated into client state space and applied to the local document. Self-initiated operations (like a’) are actually server acknowledgements of our currently-pending operation. Once we receive this acknowledgement, we can flush the client buffer and send the pending operations up to the server.

Moving forward with our example, let’s say that the client receives operation c from the server. Since c is already parented on a version in our local history, we can apply simple OT to transform it against the composition of a and b and apply the resulting operation to our local document:


Of course, as we always need to keep in mind, the client is a live editor which presumably has a real person typing madly away, changing the document state. There’s nothing to prevent the client from creating another operation, parented off of c’ which pushes it even further out of sync with the server:


This is really getting to be a bit of a mess! We’ve only sent one of our operations to the server, we’re trying to buffer the rest, but the server is trickling in more operations to confuse things and we still haven’t received the acknowledgement for our very first operation! As it turns out, this is the most complicated case which can ever arise in a Wave-style collaborative editor. If we can nail this one, we’re good to go.

The first thing we need to do is figure out what to do with d. We’re going to receive that operation before we receive a’, and so we really need to figure out how to apply it to our local document. Once again, the problem is that the incoming operation (d) is not parented off of any point in our state space, so OT can’t help us directly. Just as with b in our fundamental compound OT example from earlier, we need to infer a “bridge” between server state space and client state space. We can then use this bridge to transform d and slide it all the way down into position at the end of our history.

To do this, we need to identify conceptually what operation(s) would take us from the parent of d to the the most recent point in our history (after applying e). Specifically, we need to infer the green dashed line in the diagram below. Once we have this operation (whatever it is), we can compose it with e and get a single operation against which we can transform d.


The first thing to recognize is that the inferred bridge (the green dashed line) is going to be composed exclusively of client operations. This is logical as we are attempting to translate a server operation, so there’s no need to transform it against something which the server already has. The second thing to realize is that this bridge is traversing a line parallel to the composition of a and b, just “shifted down” exactly one step. To be precise, the bridge is what we would get if we composed a and b and then transformed the result against c.

Now, we could try to detect this case specifically and write some code which would fish out a and b, compose them together, transform the result against c, compose the result of that with e and finally transform d against the final product, but as you can imagine, it would be a mess. More than that, it would be dreadfully inefficient. No, what we want to do is proactively maintain a bridge which will always take us from the absolute latest point in server state space (that we know of) to the absolute latest point in client state space. Thus, whenever we receive a new operation from the server, we can directly transform it against this bridge without any extra effort.

Building the Bridge

We can maintain this bridge by composing together all operations which have been synthesized locally since the point where we diverged from the server. Thus, at first, the bridge consists only of a. Soon afterward, the client applies its next operation, b, which we compose into the bridge. Of course, we inevitably receive an operation from the server, in this case, c. At this point, we use our bridge to transform c immediately to the correct point in client state space, resulting in c’. Remember that OT derives both bottom sides of the diamond. Thus, we not only receive c’, but we also receive a new bridge which has been transformed against c. This new bridge is precisely the green dashed line in our diagram above.

Meanwhile, the client has performed another operation, e. Just as before, we immediately compose this operation onto the bridge. Thanks to our bit of trickery when transforming c into c’, we can rest assured that this composition will be successful. In other words, we know that the result of applying the bridge to the document resulting from c will be precisely the document state before applying e, thus we can cleanly compose e with the bridge.

Finally, we receive d from the server. Just as with c, we can immediately transform d against the bridge, deriving both d’ (which we apply to our local document) as well as the new bridge, which we hold onto for future server translations.


With d’ now in hand, the next operation we will receive from the server will be a’, the transformed version of our a operation from earlier. As soon as we receive this operation, we need to compose together any operations which have been held in the buffer and send them off to the server. However, before we send this buffer, we need to make sure that it is parented off of some point in server state space. And as you can see by the diagram above, we’re going to have troubles both in composing b and e (since e does not descend directly from b) and in guaranteeing server parentage (since b is parented off of a point in client state space not shared with the server).

To solve this problem, we need to play the same trick with our buffer as we previously played with the translation bridge: any time the client or the server does anything, we adjust the buffer accordingly. With the bridge, our invariant was that the bridge would always be parented off of a point in server state space and would be the one operation needed to transform incoming server operations. With the buffer, the invariant must be that the buffer is always parented off of a point in server state space and will be the one operation required to bring the server into perfect sync with the client (given the operations we have received from the server thus far).

The one wrinkle in this plan is the fact that the buffer cannot contain the operation which we have already sent to the server (in this case, a). Thus, the buffer isn’t really going to be parented off of server state space until we receive a’, at which point we should have adjusted the buffer so that it is parented precisely on a’, which we now know to be in server state space.

Building the buffer is a fairly straightforward matter. Once the client sends a to the server, it goes into a state where any further local operations will be composed into the buffer (which is initially empty). After a, the next client operation which is performed is b, which becomes the first operation composed into the buffer. The next operation is c, which comes from the server. At this point, we must somehow transform the buffer with respect to the incoming server operation. However, obviously the server operation (c) is not parented off of the same point as our buffer (currently b). Thus, we must first transform c against a to derive an intermediary operation, c”, which is parented off of the parent of the buffer (b):


Once we have this inferred operation, c”, we can use it to transform the buffer (b) “down” one step. When we derive c”, we also derive a transformed version of a, which is a”. In essence, we are anticipating the operation which the server will derive when it transforms a against its local history. The idea is that when we finally do receive the real a’, it should be exactly equivalent to our inferred a”.

At this point, the client performs another operation, e, which we immediately compose into the buffer (remember, we also composed it into the bridge, so we’ve got several things going on here). This composition works because we already transformed the buffer (b) against the intervening server operation (c). So e is parented off of c’, which is the same state as we get were we to apply a” and then the buffer to the server state resulting from c. This should sound familiar. By a strange coincidence, a” composed with the buffer is precisely equivalent to the bridge. In practice, we use this fact to only maintain one set of data, but the process is a little easier to explain when we keep them separate.

Checkpoint time! The client has performed operation a, which it sent to the server. It then performed operation b, received operation c and finally performed operation e. We have an operation, a” which will be equivalent to a’ if the server has no other intervening operations. We also have a buffer which is the composition of a transformed b and e. This buffer, composed with a”, serves as a bridge from the very latest point in server state space (that we know of) to the very latest point in client state space.

Now is when we receive the next operation from the server, d. Just as when we received c, we start by transforming it against a” (our “in flight” operation). The resulting transformation of a” becomes our new in flight operation, while the resulting transformation of d is in turn used to transform our buffer down another step. At this point, we have a new a” which is parented off of d and a newly-transformed buffer which is parented off of a”.

Finally, we receive a’ from the server. We could do a bit of verification now to ensure that a” really is equivalent to a’, but it’s not necessary. What we do need to do is take our buffer and send it up to the server. Remember, the buffer is parented off of a”, which happens to be equivalent to a’. Thus, when we send the buffer, we know that it is parented off of a point in server state space. The server will eventually acknowledge the receipt of our buffer operation, and we will (finally) converge to a shared document state:


The good news is that, as I mentioned before, this was the most complicated case that a collaborative editor client ever needs to handle. It should be clear that no matter how many additional server operations we receive, or how many more client operations are performed, we can simply handle them within this general framework of buffering and bridging. And, as when we sent the a operation, sending the buffer puts the client back into buffer mode with any new client operations being composed into this buffer. In practice, an actively-editing client will spend most of its time in this state: very much out of sync with the server, but maintaining the inferred operations required to get things back together again.


The OT scheme presented in this article is precisely what we use on Novell Pulse. And while I’ve never seen Wave’s client code, numerous little hints in the whitepapers as well as discussions with the Wave API team cause me to strongly suspect that this is how Google does it as well. What’s more, Google Docs recently revamped their word processing application with a new editor based on operational transformation. While there hasn’t been any word from Google on how exactly they handle “compound OT” cases within Docs, it looks like they followed the same route as Wave and Pulse (the tell-tale sign is a perceptible “chunking” of incoming remote operations during connection lag).

None of the information presented in this article on “compound OT” is available within Google’s documentation on (unfortunately). Anyone attempting to implement a collaborative editor based on Wave’s OT would have to rediscover all of these steps on their own. My hope is that this article rectifies that situation. To the best of my knowledge, the information presented here should be everything you need to build your own client/server collaborative editor based on operational transformation. So, no more excuses for second-rate collaboration!


  • To obtain Google’s OT library, you must take a Mercurial clone of the wave-protocol repository:

    $ hg clone wave-protocol

    Once you have the source, you should be able to build everything you need by simply running the Ant build script. The main OT classes are org.waveprotocol.wave.model.document.operation.algorithm.Composer and org.waveprotocol.wave.model.document.operation.algorithm.Transformer. Their use is exactly as described in this article. Please note that Transformer does not handle compound OT, you will have to implement that yourself by using Composer and Transformer. Operations are represented by the org.waveprotocol.wave.model.document.operation.DocOp interface, and can be converted into the more useful org.waveprotocol.wave.model.document.operation.BufferedDocOp implementation by using the org.waveprotocol.wave.model.document.operation.impl.DocOpUtil.buffer method.

    All of these classes can be found in the fedone-api-0.2.jar file.

  • Google’s Own Whitepaper on OT
  • The original paper on the Jupiter system (the primary theoretical basis for Google’s OT)
  • Wikipedia’s article on operational transformation (surprisingly informative)