Skip to content

Infinite Lists for the Finitely Patient

17
Nov
2008

Functional programming has a lot of weird and abstract concepts.  Monads of course are the poster child for all that is strange and confusing in functional languages, but there are other examples.  At first glance, it seems that the concept of an infinite list would be just as bizarre and incomprehensible as anything else in the paradigm, but really things aren’t as bad as all that.  In fact, even the die-hard imperative programmer can likely benefit from the patterns and techniques enabled by this obtuse construct.

For the first part of the article, I’m going to use a lot of examples involving numbers.  This is just because I’m lazy and only want to introduce one concept at a time.  Everyone understands the idea that there are an infinite amount of integers starting from 0 and counting upward.  Please bear with me all the way to the end, I promise I will give some more convincing motivation for infinite lists along the way.

If you’re already familiar with the semantics of infinite lists, feel free to skip to the section answering that all-important question: Why Do We Care?

Syntax

Before we dive into the topic of infinite lists and how they are practically applicable, it’s important to understand the underlying mechanics and (critically) how the API works in practice.  I’ll be using Scala to illustrate.  Any language which supports lambdas would actually suffice, but Scala’s unique blend of the functional and the object-oriented can be very useful in illustrating infinite lists in an imperative context.  Also, for the duration of the article, I will be making use of the following implicit conversion:

class RichStream[A](str: =>Stream[A]) {
  def ::(hd: A) = Stream.cons(hd, str)
}
 
implicit def streamToRichStream[A](str: =>Stream[A]) = new RichStream(str)

The only thing you need to know about the above is that it dramatically simplifies the syntax required when working with infinite lists in Scala.  Since the focus of this article is infinite lists and not implicit conversions, we will just take the above as magic incantations and leave it at that.

The primary medium for working with infinite lists in Scala is the scala.Stream class.  This class is not magic, you could write it yourself in plain-old-Scala without too much trouble.  Stream inherits from the all-encompassing scala.Iterable trait, meaning that familiar methods like map, foldLeft and foreach are all available for use with an infinite list as well as a finite one.

In terms of mechanics, infinite lists are very similar to finite ones: composed of a head—which is some data of type A—and a tail—which is another Stream[A].  The critical distinction which allows infinite lists to be infinite is the fact that this tail is lazily evaluated.  That means that the tail as a value is not available until you ask for it.  This is a profound idea with far reaching consequences, particularly in languages which take it to its greatest extreme (Haskell).

Just like with a normal, finite List, we build instances of Stream using a cons cell.  This is basically the in-code representation of head/tail concept.  Normally, this “cons for Streams” is handled by the singleton method Stream.cons (the equivalent of Clojure’s lazy-cons).  However, thanks to our magic implicit conversion noted above, we can make use of the same right-associative :: operator that we use with regular every-day finite Lists.  For example, we can easy create an infinite list of the natural numbers (from 0 to ∞):

def from(n: Int): Stream[Int] = n :: from(n + 1)
 
val nats = from(0)

In this example, from is what is known as a generator function.  That is to say, it takes an input and transforms it directly into some output structure.  Well, actually that’s only part of the definition, but the rest of it gets a little weird…

One very important property of from which should immediately jump out at you is the fact that it is infinitely recursive.  It takes a number, invokes this mysterious :: operator on that value and then recursively calls back to itself.  There is no conditional guard, no base case, just an endless series of calls.  Intuitively, this should lead to non-termination at runtime…except that it doesn’t.  Remember what I said about the lazily-evaluated tail?  This is where that idea really begins to take effect.  The from function is not infinitely recursive; at least, not right away.

Because the tail of a Stream is lazy, there is no need to actually invoke from(n + 1) until that particular element of the list (or one which comes after it) is required.  To prove this intuition, we can add a side-effect to the from function so that we can trace the execution:

def from(n: Int): Stream[Int] = {
  println("Requesting n = " + n)
  n :: from(n + 1)
}
 
val nats = from(0)

This code will print exactly (and only) the following:

  Requesting n = 0

The invocation from(1) is never processed because it does not need to be.  We can force further evaluation of the list by requesting a later element:

nats(4)      // get the fifth natural number (4)

This will print the following:

  Requesting n = 1
  Requesting n = 2
  Requesting n = 3
  Requesting n = 4

Notice that we don’t evaluate for n = 0 a second time.  Infinite lists are lazy, but not inefficient.  Each element is evaluated exactly once, and only as necessary.  You can exploit this property to do some really amazing things (such as the most concise Fibonacci sequence you will ever see).

As mentioned previously, all of the standard utility methods available on collections are also usable with infinite lists.  For example, it might be moderately interesting to get a list of all multiples of 5.  We could create a new generator function (e.g. fromBy5) which only returned multiples of five, but there is a much easier technique we could apply.  We already have a list of all natural numbers, why not make use of that?  If we multiplied every natural number by 5, the result would be identical to the infinite list of the multiples of 5.  If you haven’t taken calculus, this statement probably sounds a little weird.  After all, wouldn’t the list of all natural numbers multiplied by 5 be bigger than just the list of the multiples of five?

Actually, the answer to this question is “no”, and we could prove that mathematically if it weren’t entirely besides the point.  The really interesting thing to focus on is the technique suggested: take the infinite list of natural numbers (which we already have) and multiply each element by 5, resulting in another infinite list.  If you’re at all familiar with the higher-order processing of finite collections, you should be thinking map:

val fives = nats map { _ * 5 }

For those of you not “in the know”, the ever-cryptic { _ * 5 } syntax actually means the following: { x => x * 5 }.  The advantage of the former over the latter is merely brevity.

The definition of the map method goes something like this:

For every element within the target collection, apply the specified function value and store the result in the corresponding location of a new collection.

“For every element…”  That phrase should be raising huge red flags.  It implies that map is trying to iterate over the entire nats collection—a collection of infinite size.  Once again, we seem to have created a non-terminating expression.

Fortunately, the laziness of infinite lists comes to our rescue.  Just like :: doesn’t evaluate the tail until it is absolutely required to do so, map doesn’t actually invoke the function it is passed until an element is retrieved from the mapped list.  Once again making use of our println-enabled from method, we can test when the elements are being retrieved:

val nats = from(0)
val fives = nats map { _ * 5 }
 
println("Querying list...")
fives(4)                      // get the fifth multiple of five (20)

When this code executes, it will print the following:

  Requesting n = 0
  Querying list...
  Requesting n = 1
  Requesting n = 2
  Requesting n = 3
  Requesting n = 4

Surprised?  It helps to remember this simple fact: everything about an infinite list is lazy.  Nothing is evaluated until it absolutely must.

So, that means we can do something like the following, right?

for (n <- fives) {
  println(n)
}

If we were to execute this snippet, the result would be as follows:

  Requesting n = 0
  0
  Requesting n = 1
  5
  Requesting n = 2
  10
  Requesting n = 3
  15
  Requesting n = 4
  20
  Requesting n = 5
  25
  Requesting n = 6
  30
  .
  .
  .

In other words, this evaluation really is non-terminating.  This is because there is no way for the laziness of the list to help us.  We’re trying to iterate over every element in the list, retrieving that element for evaluation right away.  In this sort of situation, evaluation cannot be deferred, and thus the program will never end.

As it turns out, there are a number of operations on infinite lists which cannot be done lazily.  The foreach method is one of these, as are foldLeft, foldRight, reduce(Left and Right), length, etc.  As a general rule of thumb, virtually any operation on an infinite list which produces another infinite list can be performed lazily.  Otherwise, if a result is of an immediate type (such as Int or even Array), the evaluation absolutely must be performed at that exact point in the code.  More formally: any operation on an infinite list which is catamorphic (meaning that it takes the whole collection and reduces it to a single value) will force the evaluation of the collection in its entirety.

So if it’s impossible to iterate, fold or get the length of an infinite list, of what use could they possibly be?  It is true that infinite lists in and of themselves would be too “high-minded” for any practical application, but that’s what the take method is designed to fix.  The Stream#take(Int) method is a way of imposing a narrow window on an infinite list.  Rather than looking at every single natural number from 0 to infinity, we just look at the numbers from 0 to 9.  Put into code:

val nats = from(0)
val digits = nats.take(10)
 
for (d <- digits) {
  println(d)
}

This will print (exactly):

  0
  1
  2
  3
  4
  5
  6
  7
  8
  9

The beauty of this is that we have created an infinite list, taken exactly what we needed and ignored the rest.  Most of the time, this is how you will use the streams which you create.  Note that the take method still does not force evaluation of the list.  If we go back to our fancy printlnified version of from, we can see that evaluation is still being performed lazily, just limited to an upper bound of 9 (because we took the first 10 elements, starting from 0):

  Requesting n = 0
  0
  Requesting n = 1
  1
  Requesting n = 2
  2
  Requesting n = 3
  3
  Requesting n = 4
  4
  Requesting n = 5
  5
  Requesting n = 6
  6
  Requesting n = 7
  7
  Requesting n = 8
  8
  Requesting n = 9
  9

Why Do We Care?

Thus far, we have looked at a few examples of infinite lists, all of them having to do with numbers.  This list is surprisingly useful (as we will see in a moment), but it is far from the only infinite list we can create.  This section will explore some slightly more down-to-earth applications for infinite lists, including use-cases which everyone (even the dirty-scumbag imperative lovers) can relate to.  :-)

Conceptually, infinite lists are a way of flattening recursion and packaging it up in a form which can be manipulated in a less clumsy fashion.  Got that?  Good.  Now, read it again.

Let’s consider the simple (and extremely common) case in C/C++ where we must iterate over all of the elements of an array.  I’m using C++ here because Java has the enhanced for-loop which completely blows my example out of the proverbial water:

char *names[6] = { "Daniel", "Chris", "Joseph", "Renee", "Bethany", "Grace" };
int len = 6;
 
for (int i = 0; i < len; ++i)
{
    cout << names[i] << endl;
}

We’ve all written code like this.  In fact, most of us have probably written it so often that the prefix “for (int i = 0; ” has become part of our permanent muscle memory.  However, look at what we have done here.  Strictly speaking, we’re not looping over the the array of names at all.  We are actually looping over every successive integer between 0 and 6.

Replicating these semantics in a functional programming language is actually a bit tricky (assuming that we rule out higher-order solutions like foreach).  Ignoring for the moment the issues associated with side-effects and I/O, we have a somewhat more serious issue to worry about: functional languages don’t have any loops.  Of course, Scala does (while and do/while), but for the sake of argument we will ignore this fact.

The idiom for “looping” in functional languages is to use recursion.  Instead of maintaining a mutable counter (i) which changes value at each iteration, we will define a loop method which takes a parameter i, processes the ithe element of an array and then recursively calls itself, passing i + 1.  In code, this looks like the following:

val names = Array("Daniel", "Chris", "Joseph", "Renee", "Bethany", "Grace")
 
def loop(i: Int) {
  if (i < names.length) {
    println(names(i))
    loop(i + 1)
  }
}
loop(0)

Just in case you were confused as to why some people dislike functional languages, the above is “it”.  Scala will compile the above into a form which is actually just as efficient as the earlier C++ example, but that’s not really the point.  The problem is that this code is horribly ugly.  The underlying meaning (looping over the array) is completely obscured behind a fog of syntax and method dispatch.  You can easily imagine how a more complex looping example (particularly with nested loops) might get unmaintainably messy.

This is where infinite lists swoop in and save the day.  Remember when I said that infinite lists serve to bottle up recursion in a form that we can swallow?  (if you don’t remember, scroll back up and read it again)  We can make use of the fact that we have already defined an infinite list of all integers starting at 0.  All we have to do is just take exactly the number we need (names.length) and then use a for-comprehension to iterate over the result:

for (i <- nats.take(names.length)) {
  println(names(i))
}

If you actually unroll the evaluation semantics of the above, you will find that we’re still using recursion, it’s just being handled behind the scenes in the from method.  This is what I meant when I said that infinite lists “flatten recursion”.

Note that the above idiom is so common-place that Scala actually defines a specialization of infinite lists just to represent ranges of integers.  Oddly enough, this class is called “Range“.  We can create a range by using the until method on a starting integer, passing it an end-point:

for (i <- 0 until names.length) {
  println(names(i))
}

This way, we don’t have to define a new infinite list of nats every time we need to iterate over a range of numbers.

Of course, iteration over an integer range is a fairly trite example.  Everyone reading this article should know that there are better ways of accomplishing what we just did.  For example, the higher-order method foreach is capable of printing the names array in a far more concise fashion:

names foreach println

Let’s try something a little more complex.  Imagine that we have to loop through 10 random integers between -100 and 100.  Just to make this really fun, let’s say that we have to perform this as an inner and an outer loop, printing out all combinations of the same 10 random numbers.

If it weren’t for the restriction that the 10 random numbers need be the same between both loops, an imperative programmer might just do something like the following:

int a = 0;
for (int i1 = 0; i1 < 10; i1++) {
    a = // generate random number
 
    int b = 0;
    for (int i2 = 0; i2 < 10; i2++) {
        b = // generate random number
        System.out.println("(" + a + ", " + b + ")");
    }
}

This code is really nasty.  Not only that, but it doesn’t do what we want since the random numbers generated for the inner loop will be different from the ones generated for the outer.  We could technically fiddle with shared seeds and the like, but even that isn’t a really elegant solution.

Using infinite lists, the solution is surprisingly simple.  All we need to do is define an infinite list of random numbers between -100 and 100, then take the first 10 values from this list and perform standard nested iteration using a for-comprehension.  It’s much easier to write in Scala than it is to say:

def randomGen: Stream[Int] = (Math.random * 200).toInt - 100 :: randomGen
val random = randomGen
 
for (a <- random.take(10); b <- random.take(10)) {
  println("(" + a + ", " + b + ")")
}

When this executes, it will print the following sequence:

  (68, 68)
  (68, -83)
  (68, -79)
  (68, -78)
  (68, -18)
  (68, -80)
  (68, 29)
  (68, 10)
  (68, -63)
  (68, 3)
  (-83, 68)
  (-83, -83)
  (-83, -79)
  (-83, -78)
  .
  .
  .

Pretty nifty, eh?  You’ll notice that this time our generator function didn’t need to take a parameter.  This dramatically simplified the syntax involved in generating the infinite list.  However, despite the fact that we no longer require a parameter for our generator, we still need to assign the list to a value.  Consider these two (incorrect!) definitions for random:

def random: Stream[Int] = (Math.random * 200).toInt - 100 :: random
// or even...
val random: Stream[Int] = (Math.random * 200).toInt - 100 :: random

The first definition for random (as a function) will indeed work, producing an infinite list of random numbers.  However, that list will not be repeatable.  In other words, consecutive invocations of random (such as in our nested loop example) will produce different lists.  Swinging to the other end of the scale, the value definition for random will actually produce an infinite list of exactly the same number (e.g. 42, 42, 42, 42, 42, 42, ...).  The reason for this is left as an exercise for the reader.

Let’s consider one last example.  This time, we’ll look at something a little less “functional” in nature.  Imagine that you have defined a database schema which looks like the following:


people
id INTEGER
firstName VARCHAR
lastName VARCHAR
parentID INTEGER

This is a fairly simple parent/child schema with each row having a reference to its parent.  It’s not difficult to imagine a scenario where one might want to traverse the “ancestry” of a row in the table, perhaps looking for some specific criterion or just for enumerative purposes.  This could be done in Java (and other languages) by maintaining a mutable “person” variable containing the current row in the database.  This variable could be modified iteratively based on successive queries determining its corresponding parentID.  However, as you can imagine, the code would be messy and prone to subtle errors.

The better approach in this situation is actually to make use of an infinite list.  While it is true that the list could not possibly be infinite given a finite database, the concept of a lazily-constructed collection can still simplify the issue.  We would construct the list using the following generator method:

def ancestors(id: Int) = {
  val conn: Connection = ...
 
  try {
    val res = conn.executeQuery("SELECT parentID FROM people WHERE id = ?")
 
    if (res.next()) {
      val parentID = res.getInt(1)
      parentID :: ancestors(parentID)
    } else Stream.empty
  } finally {
    conn.close()
  }
}

One item of note regarding this function is the use of the Stream.empty constant.  This is precisely analogous to the Nil object used in constructing conventional lists: it simply marks the end of the linked list.  Because every person will have a finite ancestry, we aren’t actually exploiting the infinite potential (no pun intended) of streams.  In fact, we are only using the data structure as a tool to facilitate lazy evaluation.

The nice thing about representing the database traversal in the given fashion is the ability to treat the results as a fully-realized collection even before the queries have actually been run.  For example, we can generate a list of all people in the parentage of person 42, and then restrict this list to only people with an odd id.  I have no idea why this would be useful, but here it is:

val parents = ancestors(42)
val fewerParents = parents filter { _ % 2 == 1 }
 
for (id <- fewerParents) {
  println(nameForPerson(id))       // get a person's name and print
}

All of the logic associated with traversing the database tree is nicely tucked away within the ancestors function.  This separation of logic allows for much cleaner code at times, hence reducing the potential for silly mistakes.

Conclusion

Infinite lists can be a surprisingly powerful tool, even in the arsenal of someone who is still devoted to the Dark Side of imperative programming.  Our last example looked at a scenario where an infinite list could be useful, despite the obviously side-effecting nature of the problem.  Similarly, the example of a list of random numbers shows how infinite lists can even be used to solve problems which are extremely difficult to solve using classical, imperative techniques.

To summarize: infinite lists are not just a beautiful mathematical formalism stemming from the forbidding forests of Haskell, they can also be a useful device when approaching down-to-earth problems like complex iteration.  The next time you find yourself writing unpleasant code to loop over complex data sets, stop and consider whether or not a stream might be a better solution.

Improving the STM: Multi-Version Concurrency Control

10
Nov
2008

I wrote a post some time ago introducing a software transactional memory (STM) framework for Scala based on implicit conversions.  This framework had some nice features, like a very clean syntax coupled with compile-time verification of transactional semantics, but it really wasn’t ready for real-world use.  There was one very serious skeleton in the closet that I carefully avoided as I went through the presentation: non-deterministic behavior was still possible, even within a transaction.

The Problem

Allow me to illustrate.  Imagine two transactions, each executing concurrently in a world with two reference cells.  One transaction reads from both cells, subtracts one from the other and then uses the result as a divisor.  The other transaction increments both cells in lock-step with each other.  Put into code, this would look something like the following:

val a = new Ref(1)
val b = new Ref(0)
 
def compute(x: Int)(implicit t: Transaction) = x / (a - b)
 
def interfere(implicit t: Transaction) {
  a := a + 1
  b := b + 1
}
 
atomic(compute(3)(_))         // in thread-1
...
atomic(interfere(_))          // in thread-2

Incidentally, I know that the whole interfere(_) syntax is a bit ugly.  Voice your opinion on RFE #1492.

This series of computations, while useless, should work just fine under normal circumstances.  The STM is going to verify that neither transaction is stepping on the other with any lasting effect, but unfortunately there are things that can go wrong before we get to “lasting”.  It may not be obvious, but this code actually has a very real probability of throwing an ArithmeticException due to a division by zero.

This unfortunate circumstance arises when the “interfere” transaction commits after the “compute” transaction has read a value for a but before it reads a value for b.  When this occurs, compute will obtain a value for b which is precisely the same as the value it has already read from a.  Now, interfere has also changed the global state of a to preserve its property of leading b by 1, but there is no way for compute to know that: it has already extracted a value for a, it’s not going to check to ensure that it’s still sacred.  What happens next is of course trivial: with a and b having the same value, the subtraction operation results in 0, causing the division to fail.

Note that this issue does not indicate a failure in the safeguards, just an incomplete specification for the STM.  If the reference fault had not resulted in an exception, the commit process for the “compute” transaction would have caught the problem and retried the transaction.  In other words, if the problem hadn’t caused a catastrophic failure within the transaction, the STM would have gracefully recovered and no one would have been any the wiser.

When I created the STM implementation, I recognized that this case can indeed arise.  To compensate for this, I created a dumb catch-all within the transaction dispatch process.  This code literally swallows all exceptions coming from within the transaction, treating them as signals to abort and retry.  This works-around our exceptional-interferance issue from above, but it does bring consequences of its own:

def badBoy(implicit t: Transaction) {
  throw RuntimeException("Er...")
}

This transaction will always throw an exception, regardless of the state of the references or any other transactions.  With the originally envisioned design, this exception would have simply propagated out from where the transaction was kicked off (wherever the atomic method was called) and life would have been happy.  Unfortunately, with our catch-all in place (required to solve the earlier problem) this code will fall into an infinite loop.

The problem is that the catch-all will field the RuntimeException and assume that it’s a divine signal that trouble is on the way.  This assumption that data-integrity has been somehow compromised leads the STM to simply retry the transaction.  However, since the exception is thrown every time, the catch-all will just keep on swallowing the same exception, endlessly-retrying the same tired transaction.  Naturally, this is just a trivial example, but it’s not difficult to imagine a block of code which really does throw a legitimate exception, one which we would want to propagate back into user-land where we can effectively deal with the situation.  An infinite loop of failing transactions is not the right way to handle the problem.

Speaking of infinite loops, we still haven’t really solved our interference problem from above.  In our first snippet, we created a data contention situation which could non-deterministically cause an exception to be thrown.  In order to solve this problem, we introduced a catch-all to swallow the exception and just keep right on trucking.  This “solution” not only introduces some undesired behavior (swallowing legitimate exceptions), but it also fails to really fix the issue.  Consider the following slight variation on our previous theme:

val a = new Ref(1)
val b = new Ref(0)
 
def maybeLoop(implicit t: Transaction) {
  while (a <= b) {
    // should never happen
  }
}
 
def interfere(implicit t: Transaction) {
  a := a + 1
  b := b + 1
}

In case it isn’t obvious, the while-loop in maybeLoop should never actually execute.  We have defined references a and b to have values such that a is always strictly greater than b.  The only transaction which modifies these values preserves this property, and so by easy induction we can prove that the conditional “a <= b” is in fact a contradiction.

Unfortunately, our little hand-wavy proof doesn’t take non-determinism into account.  Remember from the first example where the “interfere” transaction committed between the reads for a and b?  Earlier, this just raised an ArithmeticException.  However, in this case, such a conflict would actually cause maybeLoop to suddenly become unterminating!  Because of the way that this data contention falls out, we could end up with an infinite loop where we had expected no loop at all.

This is very, very bad for a number of reasons.  With our exception problem, we were just able to introduce a catch-all to get around the issue.  The same trick isn’t going to work with an infinite loop.  Control is never returned to the transaction controller, so there is no way to inject any logic in this case.  What’s (possibly) worse is there is no way for us to even try to detect these sorts of situations.  This is actually a slight specialization of the halting problem.  Briefly: because any Turing Complete language must be capable of non-terminating execution, the introduction of non-deterministic evaluation into such a language can result in non-deterministic non-terminating execution.  Since it is impossible in general for one Turing Machine to determine whether or not another will halt on a given input, it must follow that it is also impossible to determine whether or not a non-deterministic Turing Machine will halt.  My logical form is atrocious, but I think you get the picture.

The Solution

The point is that we have a problem, and cop-out solutions like a catch-all exception handler isn’t going to solve it.  The only real answer1 would be to somehow ensure that when “interfere” commits, it doesn’t actually change the values of a and b as visible to the other running transaction.  In other words, transactional commits should change the global state, but leave any transaction-local state untouched.  That way, in-progress transactions can work with the reference values which were in place at the start of the transaction, deterministically work their way down to the bottom and finally either commit or detect conflicts and retry.  With this solution, exceptions can be propagated out to user-land because every exception will be valid, none will ever be caused by a reference fault.

In a nutshell, this strategy is like giving each transaction its own “snapshot of the world”.  From the moment the transaction starts, it remains entirely oblivious to any changes going on in the world outside.  It’s as if each transaction got to stop the universe, do its work, commit and then allow things to pick up right where they left off.  However, instead of actually stopping the universe, each transaction will preserve all reference state from that exact moment and allow everyone else to keep right on rolling.  In technical terms, this is called MVCC: Multi-Version Concurrency Control.

If we apply this technique to our first dilemma involving compute and interfere, we will find that even when interfere commits between the reads of a and b, there really is no issue.  Rather than getting a new value of b which is inconsistent with its already-retrieved old value of a, the “compute” transaction will just see the old values for both a and b.  None of the changes to the reference world outside of compute are observed until after the transaction completes.  At that point, the STM will detect any reference faults (such as interfere causing interference) and if necessary, retry the transaction.

The Implementation

This all sounds quite well-and-good, but how would we go about creating such a magical fairy land in the sky?  The most intuitive approach would be to copy every live reference whenever we start a transaction.  This isn’t really as bad as it sounds, seeing as any data contained within a reference must be immutable, so copies are literally free (just use the original data).  Unfortunately, that doesn’t stop the entire process from being heinously inefficient.  It’s easy to imagine a system with 100,000 references and a tiny little transaction which only uses 2 of them.  Do we really want to loop through 100,000 references just to preserve a pair of them?

Obviously, a different approach is needed.  That’s where the committal process comes into play.  With the old design, when a transaction commits, it would first check for any reference faults, then if everything was dandy it would write the necessary changes to the global state for each reference.  In code it looked like this:

def commit() = {
  if (world.size > 0) {
    CommitLock.synchronized {
      val back = world.foldLeft(true) { (success, tuple) =>
        val (ref, _) = tuple
        success && ref.rev == version(ref)
      }
 
      if (back) {
        for (ref <- writes) {
          ref.contents = (world(ref), rev)
        }
      }
 
      back
    }
  } else true
}

MVCC is going to require this commit method to do a little extra legwork.  Rather than just writing the changes into the global state, commit will need to loop through any active transaction, saving the old values for only the modified references within each of these transaction’s logs.  This dramatically reduces the amount of looping and saving which takes place without actually imposing too much extra overhead.  This change – combined with a separation of reads and writes within a transaction – actually looks like the following:

def preserve[T](ref: Ref[T]) {
  val castRef = ref.asInstanceOf[Ref[Any]]
 
  if (!world.contains(castRef)) {
    val (v, rev) = ref.contents
 
    world(castRef) = v
    version(castRef) = rev
  }
}
 
def commit() = {
  if (world.size > 0) {
    CommitLock.synchronized {
      val f = { ref: Ref[Any] => ref.rev == version(ref) }
      val back = reads.forall(f) && writes.forall(f)
 
      if (back) {
        for (ref <- writes) {
          for (t <- Transaction.active) {
            if (t != this) t.preserve(ref)       // preserve the old contents of t
          }
 
          ref.contents = (world(ref), rev)
        }
      }
 
      back
    }
 
  } else true
}

Amazingly enough, this tiny little change is all that is required to implement MVCC within our STM.  Well, of course I’m skipping the implementation of Transaction.active as well as the bitsy concurrency semantics required to make it all work, but you weren’t really interested in any of that, were you?

For those who are keeping score, Clojure already has MVCC implemented within its hallmark STM.  As a point of interest, the implementation is subtly different from the one I have just outlined.  Instead of placing the old reference values within the local transaction logs, Clojure stores these contents as old versions of the TVar within the reference itself.  I’m not sure if I like this approach better or not, but it certainly seems to work well!  In either case, the semantics are essentially identical.

Final aside worthy of mention: you need not concern yourself that all of this extra copying is going to eat through heap space.  This is one of the huge benefits of immutable data: copies are free.  We aren’t really copying the data, just maintaining a reference to its previous version.  Once the transaction commits, its log will go out of scope and that old reference will be eligible for garbage collection.  All in all, the addition of MVCC to our STM doesn’t raise the overhead even the slightest bit.

Conclusion

Realizing that my blog was not the most convenient way to distribute project files for a semi-serious library like the Scala STM, I decided to put the project on GitHub.  For those who hate following links, you can checkout, build and test (see below) the library using the following commands (assuming you have both Git and Buildr installed):

  git clone git://github.com/djspiewak/scala-stm.git
  cd scala-stm
  buildr

In a flurry of activity besides the addition of MVCC, I have also added some reasonably-comprehensive BDD specs to ensure that the correctness of the implementation isn’t just in my head.  Of course, all of the tests are probabilistic, but I think the library is finally to a point where they can be relied upon.  If you like, I’ll even distribute a proper pre-built JAR.

STM is hardly in its infancy, but it is just now starting to catch on as a viable alternative to the classic locking techniques favored by concurrent programs everywhere.  Improvements like MVCC promise even better reliability and throughput, further advancing the goal of deterministic asynchronous programs which are easy to develop and even easier to reason correct.

1 There actually is another solution involving conflict detection on the fly, but it’s a little inefficient and not really as subjectively elegant as MVCC.

Scala as a Scripting Language?

3
Nov
2008

I know, the title seems a bit…bizarre.  I don’t know about you, but when I think of Scala, I think of many of the same uses to which I apply Java.   Scala is firmly entrenched in my mind as a static, mid-level language highly applicable to things like large-scale applications and non-trivial architectures, but less so for tasks like file processing and system maintenance.   However, as I have been discovering, Scala is also extremely well suited to basic scripting tasks, things that I would normally solve in a dynamic language like Ruby.

One particular task which I came across quite recently was the parsing of language files into bloom filters, which were then stored on disk.  To me, this sounds like a perfect application for a scripting language.  It’s fairly simple, self-contained, involves a moderate degree of file processing, and should be designed, coded and then discarded as quickly as possible.   Dynamic languages have a tendency to produce working designs much faster than static ones, and given the fact that the use-case required access to a library written in Scala, JRuby seemed like the obvious choice (Groovy would have been a fine choice as well, but I’m more familiar with Ruby).  The result looked something like this:

require 'scala'
 
import com.codecommit.collection.BloomSet
 
import java.io.BufferedOutputStream
import java.io.FileOutputStream
 
WIDTH = 2000000
 
def compute_k(lines, width)
  # ...
end
 
def compute_m(lines)
  #...
end
 
Dir.foreach 'wordlists' do |fname|
  unless File.directory? fname
    count = 0
    File.new "wordlists/#{fname}" do |file|
      file.each { |line| count += 1 }
    end
 
    optimal_m = compute_m(count)
    optimal_k = compute_k(count, WIDTH)
 
    set = BloomSet.new(optimal_m, optimal_k)
 
    File.new "wordlists/#{fname}" do |fname|
      file.each do |line|
        set += line.strip
      end
    end
 
    os = BufferedOutputStream.new FileOutputStream.new("gen/#{fname}")
    set.store os
    os.close
  end
end

As far as scripts go, this one isn’t too bad.  I’ve written some real whoppers for things like video encoding and incremental backups.  The main trick here is the fact that we need to make two separate passes over the same file in order to get the number of lines before constructing the set.  We could load the file into an array buffer in a single pass, count its length and then iterate over the array, placing each element in the bloom filter.   However, this really wouldn’t be too much faster than just hitting the file twice (we still need two separate passes) and it has the additional drawback of requiring a fair amount of memory.

All in all, this script is a fairly natural representation of my requirements.   I needed to loop over a number of word lists, push the results into separate bloom filters and then freeze-dry the state.  However, look at what we’ve actually done here.  Remember earlier where we were considering which language to use?  We wanted a language which could concisely and quickly express our intent.  For that decision making process, we just assumed that a dynamic language would suffice better than one hampered by a static type system.  However, at no point in the above script do we actually do anything truely dynamic.  By that I mean: open classes, unfixed parameter types, method_missing, that sort of thing.   In fact, we haven’t really done anything that we couldn’t do in Scala:

import com.codecommit.collection.BloomSet
import java.io.{BufferedOutputStream, File, FileOutputStream}
import scala.io.Source
 
val WIDTH = 2000000
 
def computeK(lines: Int, width: Int) = // ...
 
def computeM(lines: Double) = // ...
 
for (file <- new File("wordlists").listFiles) {
  if (!file.isDirectory) {
    val src = Source.fromFile(file)
    val count = src.getLines.foldLeft(0) { (i, line) => i + 1 }
 
    val optimalM = computeM(count)
    val optimalK = computeK(count, optimalM)
 
    val init = new BloomSet[String](optimalM, optimalK)
 
    val set = src.reset.getLines.foldLeft(init) { _ + _.trim }
 
    val os = new BufferedOutputStream(new FileOutputStream("gen/" + file.getName))
    set.store(os)
    os.close()
  }
}

This is actually runnable Scala.  I’m not omitting boiler-plate or cheating in any similar respect.  If you copy this code into a .scala file and make sure that BloomSet is on your CLASSPATH (which you would have needed anyway for JRuby), you would be able to run the script uncompiled using the scala command.  Unlike Java, Scala actually includes an “interpreter” which can parse raw Scala sources and execute the representative program just as if it had been pre-compiled using scalac.   One of the perquisites of this approach is the ability to simply omit any main method or Application class.  In nearly every sense of the word, Scala is a scripting language…as well as an enterprise-ready Java-killer (let the flames begin).

Now that we’re fairly convinced that the above is valid Scala, let’s compare it with the original version of the script written using JRuby.  If we just go off LoC (Lines of Code), Scala actually wins here.  This was a more-than-slightly surprising discovery for me, given how often dynamic languages (and Ruby in particular) are touted as being more concise and expressive than static languages.  But of course, sheer LoC-brevity isn’t everything: we also should consider things like readability.  A few characters of Befunge can accomplish more than I can do in several lines of Scala, but that doesn’t mean I’ll be able to figure out what it means tomorrow morning.

On the readability score, I think Scala wins here too.  The file processing and set creation is all done in a highly functional style (using foldLeft).   At least to my eyes, this is a lot easier to follow than the imperative form in Ruby.  More importantly, I think it’s a bit harder to make silly mistakes.  When I wrote the Ruby version of the script, it took several tries before I solidly pinned down the exact incantation I was seeking.  The Scala version literally required only one revision after the initial prototype.   Granted, I had the Ruby version to go off of, but I think we would all agree that the scripts use some fairly different libraries and methodologies for accomplishing identical tasks.

So what is it that makes Scala so surprisingly well suited to the task of quick-and-dirty file processing and scripting?  After all, isn’t is just a fancy syntax wrapping around the plain-old-Java standard library?  While it is true that Scala has first-class access to Java libraries (as demonstrated in the script), that isn’t all that it offers.  I believe that Scala has two important features which make it so suitable for these tasks:

  • Type inference
  • Powerful core libraries

The first feature is of course evident wherever you look in the script.  With the exception of the two methods and the BloomSet constructor, we never actually declare a type anywhere in the script.  This gives the whole thing a very “dynamic feel” without actually sacrificing static type safety.   The first time you try this sort of language feature it is an almost euphoric experience (especially coming from highly-verbose languages like Java).

The second feature is a bit harder to see.  It is most evident in the way in which we handle file IO.  The directory listing is of course yet another application of the venerable java.io.File class, but the process of opening and reading the file line-by-line seems to be a lot easier than anything Java can muster.  This is made possible by Scala’s Source API.   Rather than fiddling with BufferedReader and the whole menagerie that goes along with it, we just get a new Source from a File instance and then use conventional Scala methods to iterate over its contents.  In fact, we’re actually applying a functional idiom (fold) rather than a standard imperative iteration.  Finally, when we’re done with our first pass, we don’t need to re-open the file from scratch (inviting initialization mistakes in our coding), we just reset the Source and start from the beginning once more.

Using Scala as a scripting language comes with some pretty hefty benefits.   For one thing, you get immediate and idiomatic access to the mighty wealth of libraries which exist in Java.  Even for scripting, this sort of interoperability is invaluable.  JRuby does provide some excellent Java interop, but it simply can’t compare to what you get with Scala.  Further, Scala has a static type system to check you (at runtime with a script) to ensure that you haven’t done anything obviously bone-headed.  This too is nothing to sniff at.

Given the fact that Scala’s “scripting syntax” is just as concise as Ruby’s (sometimes more), it’s hard to see a reason not to employ it for around-the-server tasks.  Amusingly, the most compelling reason not to use Scala for scripting just might be its comment syntax.  Not having direct support for the magic “hash bang” (#!) incantation to define a file interpreter just means that Scala scripts have to go through some extra steps to be directly executable.  However, if immediately-executable scripts aren’t an issue, you may want to consider Scala as your scripting language of choice for your next non-trivial outing.  You may reap the rewards in ways you weren’t even expecting.

Is Scala Not “Functional Enough”?

20
Oct
2008

In one of Rich Hickey’s excellent presentations introducing Clojure, he mentions in passing that Scala “isn’t really a functional language”.  He says that Java and Scala are both cut from the same mold, and because Scala doesn’t force immutability it really shouldn’t qualify.  These viewpoint is something I’ve been hearing a lot of from various sources, people talking about how F# is really the only mainstream functional language, or how once Erlang takes off it will leave Scala in the dust.

When I first heard this sentiment voiced by Rich, I brushed it off as a little odd and only slightly self-serving (after all, if you don’t use Scala, there’s a better chance you will use Clojure).  Rich has his own opinions about a lot of things, but I have found with most that I can still understand his rationale, even if I don’t agree.  So, realizing that many of his other kooky ideas seemed to have some basis in reality, I decided to come back to his opinion on Scala and give it some deeper consideration.

The core of the argument made by Rich (and others) against Scala as a functional language goes something like this:

  • Mutable variables as first-class citizens of the language
  • Uncontrolled side-effects (ties in with the first point)
  • Mutable collections and other imperative libraries exist on equal footing
  • Object-oriented structures (class inheritance, overloading, etc)
  • Verbosity

Comparative Type Inference

If you’re coming from Java-land, the final point may have caught you a bit by surprise.  After all, Scala is vastly more concise than Java, so how could anyone possibly claim that it is “too verbose”?  Well, to answer that question, you have to compare Scala with the other side of the language jungle: the functional languages.  Here’s an explicitly-recursive function which sums a list of integers:

def sum(ls: List[Int]): Int = ls match {
  case hd :: tail => hd + sum(tail)
  case Nil => 0
}

That’s not too bad.  The use of pattern matching eliminates an entire class of runtime errors (selecting a non-existent element) and makes the code a lot cleaner than the equivalent Java.  However, compare this with the same function ported directly to SML (a functional language:

fun sum nil = 0
  | sum (hd :: tail) = hd + sum tail

One thing you’ll notice here is the complete lack of any type annotations.  Like most static functional languages, ML (and derivatives) has a form of type inference called “Hindley – Milner” (sometimes called “global type inference”).  Rather than just looking at a single expression to infer a type (like Scala), Hindley – Milner looks at the entire function and derives the most general (least restrictive) type which satisfies all expressions.  This means that everything can be statically type-checked with almost no need to declare types explicitly.

“Now, wait!” (you say), “You would never write a function just to sum a list; you should be using a fold.”  That’s true.  So let’s see how well these two languages do when the problem is solved in a more realistic fashion.  Once again, Scala first:

def sum(ls: List[Int]) = ls.foldLeft(0) { _ + _ }

Let’s see ML top that!

fun sum ls = foldl (op+) 0 ls

Then again, maybe we’ll just quit while we’re behind…

The fact is that Scala requires significantly more ceremony to accomplish some things which are trivial in pure-functional languages like ML and Haskell.  So while Scala may be a huge step up from Java and C++, it’s still a far cry from being the quickest and most readable way of expressing things in a functional style.

One obvious solution to this would be to just add Hindley – Milner type inference to Scala.  Well, this may be the “obvious” solution, but it doesn’t work.  Scala has an extremely powerful and complex type system, one with a number of properties which Hindley – Milner just can’t handle.  A full object-oriented inheritance hierarchy causes some serious problems with the “most general” inference of Hindley – Milner: just about everything becomes type Any (or close to it).  Also, method overloading can lead to ambiguities in the inferred types.  This is actually a problem even in the venerable Haskell, which imposes hard limitations on what functions can be in scope at any given point in time (so as to avoid two functions with the same name).

Simply put, Scala’s design forbids any type inference (that I know of) more sophisticated than local expression-level.  Don’t get me wrong, it’s still better than nothing, but a language with local type inference alone will never be as generally concise as a language with Hindley – Milner.

Side Effects

One big ticket item in the litany of complaints against Scala is the admission of uncontrolled side effects.  It’s not hard to find an example which demonstrates this property:

val name = readLine
println("Hello, " + name)

This example alone shows how fundamental side-effects are within the Scala language.  All we have done here is made two function calls, one of them passing a String and receiving nothing as a result.  From a mathematical standpoint, this code snippet is virtually a no-op.  However, we all know that the println function has an additional side effect which involves sending text to standard out.  Coming from Java, this makes perfect sense and it’s probably hard to see why this would be considered a problem.  However, coming from Haskell, what we just wrote was a complete abomination.

You see, Haskell says that no function should ever have side effects unless they are explicitly declared using a special type constructor.  In fact, this is one of the areas where monads have had a huge impact on Haskell’s design.  Consider the following Haskell equivalent:

main :: IO ()
main = do
         name <- getLine
         putStrLn ("Hello, " ++ name)

Even if you don’t know Haskell, the above should be pretty readable.  The first line is the type declaration for the main “function” (it’s actually a value, but why quibble).  Haskell does have Hindley – Milner type inference, but I wanted to be extra-explicit.  You’ll notice that main is not of type void or Unit or anything similar, it is actually of type IO parameterized with Haskell’s form of Unit: ().  This is an extremely important point: IO is a monad which represents an action with side-effects returning a value which matches its type parameter (in this case, ()).  The little dance we perform using do-notation is just a bit of syntax sugar allowing us to compose two other IO values together in a specific order.  The getLine “function” is of type IO String, meaning that it somehow reads a String value by using side effects (in this case, reading from standard in).  Similarly, putStrLn is a function of type String -> IO ().  This means that it takes a String as a parameter and uses it to perform some side effects, from which it obtains no result value.  The do-notation takes these two monadic values and composes them together, forming one big value of type IO ().

Now, this may seem horribly over-complicated, especially when compared to the nice clean side effects that we have in Scala, but it’s actually quite mathematically elegant.  You see, the IO monad is how we represent actions with side effects.  In fact, the only (safe) way to have side effects in Haskell is to wrap them up inside monad instantiations like these.  Haskell’s type system allows you to actually identify and control side effects so that they remain contained within discrete sections of your code base.

This may not sound so compelling, but remember that functional programming is all about eliminating side effects.  You compute your result, you don’t just accidentally find yourself with a magic value at the end of a long run.  The ability to work with side effects as packaged values just like any other constant is extremely powerful.  More importantly, it is far closer to the “true” definition of functional programming than what we have in Scala.

Conclusion

I hate to say it, but Rich Hickey and the others are quite right: Scala isn’t a terribly functional language.  Variables, mutable data structures, side effects and constant type declarations all seem to conspire to remove that crown from Scala’s proverbial brow.  But let’s not forget one thing: Scala wasn’t designed to be a functional language.

That may sound like heresy, but it’s true.  Scala was created primarily as an experiment in language design, specifically focusing on type systems.  This is the one area where I think Scala excels far beyond the rest of the field.  Scala makes it possible to model many problems in an abstract way and then leverage the type system to prove correctness at compile time.  This is approach is both revolutionary and an extremely natural way to solve problems.  The experience of using the type system in this fashion is a little difficult to describe (I’m still on the lookout for good examples), but trust me, you’ll like it when you see it.

Scala’s not really a functional language, and as Cedric Beaust has pointed out, it’s not really the best object-oriented language either; so what is it good for?  Scala sits in a strange middle ground between the two worlds of functional and object-oriented programming.  While this does have some disadvantages like being forced to take second place in terms of type inference, it also lets you do some really interesting stuff like build a mutable ListBuffer with constant time conversion to an immutable List, or sometimes recognize the fact that fold is not the universal solution.  It’s an experiment to be sure, but one which I think has yielded some very powerful, very useful results…just not many of a purely functional nature.

Bloom Filters in Scala (and all the fun that they bring)

13
Oct
2008

This is a post I have been meaning to write for quite some time.  Despite being an officially excommunicated former-member of the Ruby community, I still like to keep up on the latest developments.  A few months ago, my Ruby Inside feed produced an interesting post introducing a new library by Peter Cooper (the author of Ruby Inside).  Unlike the most “new” Ruby libraries which are most often just bad clones of Rails snibblets, Peter’s WhatLanguage seemed to be quite unique and interesting.  For the first time in years, I was reading about an innovation in Ruby-land which actually sounded quite exciting.  A good feeling to be sure.

In a nutshell, WhatLanguage (project page at GitHub) is a statistical language identification API which analyses arbitrary input String(s), scoring them against a set of built-in languages (including English, French, Spanish, Russian, and others).  Add to that a little bit of meta-programming magic and the possibilities are eye-catching to say the least:

require 'whatlanguage'
 
'Here is some text to identify'.language       # => :english
'¿Hola, pero donde está su pelo?'.language     # => :spanish

To be honest, I’m not sure what practical application I would have for such a library, but it sure looks like fun!  :-)   Under the surface, the implementation is deceptively simple.  WhatLanguage bundles a set of words for each language it supports.  The implementation then searches these sets for each word in the string, aggregating a score for each language depending on whether or not the word was found.  If exactly one language has a higher score than the others once the end of the string is reached, it is “declared the winner” and returned as a result.  Otherwise, the analysis is undecided and nil is returned.  Simple, right?

The tricky part is somehow storing and searching those gigantic language word sets in an efficient manner.  As it turns out, this isn’t so easy.  Just stop and consider the number of words which exist in your native tongue.  Unless you speak a very strange, primarily-tonal language, the number is likely in the upper hundreds of thousands.  Just going off of Peter’s word lists, Russian is the real trouble maker, weighing in with a nearly one million word vocabulary.  Any attempt to store that much data would require megabytes of memory and several minutes to process, even on a fast machine.  Any sort of hash set is completely out of the question due to the nature of contiguous memory allocation, but can you imagine trying to perform a search traversal on a tree with almost a million nodes?  Obviously, a more effective solution is required.

Probabilistic Sets

So far we have been focusing our imaginary efforts on data structures which guarantee 100% accuracy and full reversibility.  That is to say, data structures which allow you to get out what you put in and which ensure that if something is “in” it will be found and if something is “out” it will not.  After all, is a data structure without these guarantees really all that useful?

As it turns out, the answer is “yes”.  Consider a standard hash set as an example.  Sets make absolutely no guarantee about the ordering of their elements.  If you attempt to iterate over a set, you will eventually reach everything inside, but you cannot rely on a certain element coming before the others.  Of course, the advantage we gain by making this trade-off is constant time searching by item identity.  Lists cannot achieve this performance, neither can arrays.  If you want to search either one of these structures, you must look through every element from start to finish.  So with hash sets, we have effectively traded a fundamental guarantee about our data for efficiency in some other respect.

The same is true for probabilistic sets.  We don’t actually need all of the properties of a set to implement WhatLanguage.  For example, we don’t need to actually retrieve a word from our dictionary, we only need to discover whether or not that word is in the set.  In fact, we really only need our data structure to have two important properties:

  • Lightning fast boolean searches (does the set contain this element?)
  • Minimal memory footprint

Hash sets can make the first guarantee, but as we have already noticed, they don’t really require any less memory than just storing everything in an array.  Since we can’t afford one million words-worth of memory per language, we need to come up with some way of storing information about the presence of an element without retaining the data it represents.  In order to do this, we have to make one important sacrifice: accuracy.

We basically need to accept the fact that our data structure may sometimes return inaccurate results.  As it turns out though, this is just fine.  Remember, we’re scoring languages one token at a time; an incorrect answer now and again isn’t going to kill us.  Given a large enough input string, a minimal percentage of inaccuracy will literally be lost in the noise.

As it turns out, there already exists a very old technique for accomplishing exactly these requirements.  Bloom filters have been around since the beginning of time (or at least the 1980s), but unfortunately very few developers are aware of them.  In a nutshell, they precisely satisfy all of the needs we just finished enumerating.  They provide true (not amortized) constant-time searches and will consistently return true or false for a given element, depending on whether or not the value is in the set.  Occasionally they will give false positives (claiming that an element is in the set when in fact it is not), but never false negatives.  This property is extremely useful in many applications, but it turns out that this is even more than we need for WhatLanguage.

Most importantly though, bloom filters are extremely compact in memory.  Rather than storing an entire value, they hash every element to one or (usually) more indexes in an array of bits (0 or 1).  In an optimal bloom filter guaranteeing no false positives, each element would correspond to precisely k bits of memory (where k is the number of hash functions, usually in the vicinity of 3 or 4).  Once this hashing is complete, the value itself may be discarded.  Thus, assuming an optimal bloom filter with 3 hash functions, the storage requirements for one million words would only be 375 KB of memory.  Compare this to the 8 MB required just to store the pointers to each of the words, much less the values themselves!  If we include the underlying character values, we start pushing closer to 20 MB.

375 KB for a million elements is impressive, but bloom filters can actually do a lot better than that.  Remember, we agreed to accept some false positives.  It’s easy to imagine how even an ideal bloom filter (such as the one we have been analyzing) could have some inaccuracy, but if we’re willing to live with a few more false positives, we can allow hash collisions and drop our memory requirements into the 200 KB range.  Not bad, considering how much data we’re storing!

Bloom Filters

So how does the magic actually work?  We’ve hinted at the implementation in our analysis of memory requirements, but we’re still missing some important details.  Consider the following diagram (courtesy of Wikipedia):

Under the surface, a bloom filter is just an array of boolean values.  The length of the array is referred to as the width of the bloom filter.  When the filter is empty, all of the array values are set to false.  To insert an element, we compute k different hash indexes in the array and set them all to true.  The real beauty of the data structure is the manner in which we handle collisions: we don’t worry about it.  Whenever one (or more) of the hash functions returns an index which is already true, we simply leave that index as-is and move on.  Thus, the procedure for inserting an element into a “full” bloom filter is precisely the same as inserting into an empty one: set all of the hash indexes to true.  In a sense, bloom filters can never be full, they just get less and less accurate.

Bloom filters define only one operation other than insertion: searching.  To determine whether or not a value is in the set, we run all of our hash functions and find the indexes to which it corresponds.  If all of these indexes are true, then we can assume that the value is indeed in the set.  Conversely, if any one of indexes is false, then the value cannot be in the set since we would have set that value to true upon insertion.

It is important to note that removal is not supported by bloom filters.  Intuitively we could just take the value to be removed, calculate its hash indexes and then set the corresponding elements to false.  However, if any other values also hash to even one of those locations, we will be effectively removing it from the bloom filter “by accident”.  Subsequent searches for that value will hash to the index which we erroneously set to false, resulting in a false negative on the search (saying that an element is not in the set when in fact it is).  Fortunately for us, WhatLanguage only requires insertion and searching for values, removal is not on the agenda.

Scala Implementation

Well, this has probably been a nostalgic (for those of you who enjoyed your university days) stroll through the land of data structure theory, but we’re still lacking one critical factor: a concrete implementation.  Bloom filters don’t sound that complicated on paper, but to do anything useful with them we’re going to need some code.

Before we dive in, we might want to define some more precise requirements for our implementation.  We’re writing this in Scala, so it would be nice to implement bloom filters as a functional data structure.  I’m not going to get into all the benefits of immutable data structures; suffice it to say, I really think that they’re the only way to go for most serious applications.  Unfortunately, in order to do this, we’re going to need some sort of immutable, array-like data structure under the surface which provides random access in constant time.  Sound familiar?

This is actually where the whole saga of persistent vectors in Scala found its roots.  The first time I tried to implement a functional bloom filter in Scala I made use of a conventional array under the surface with copy-on-write semantics.  As you can imagine, this is hideously inefficient in both space and time.  In order to solve the problem, I had to implement a data structure which was just as efficient for reading and writing, but which also shared structure with subsequent versions.  Now, two months later, I’ve borrowed just such a design from Clojure and created a first-class implementation in Scala.

With a persistent vector now at our disposal, we can easily implement a persistent bloom filter by using a vector to represent our boolean indexes.  Every time we perform an insertion, we will “modify” the relevant indexes, creating a new instance of the vector.  This vector can then be used in our “modified” bloom filter, returned as a new instance of our container class.  Since persistent vectors share unmodified structure between versions, so does our bloom filter.  All of the hard work of efficient immutability is already done for us, we just need to implement a thin facade.  We start out by defining the class skeleton:

import BloomSet._
 
class BloomSet[A] private (val size: Int, val k: Int, 
                           private val contents: Vector[Boolean]) extends ((A)=>Boolean) {
  val width = contents.length
 
  def this(width: Int, k: Int) = this(0, k, alloc(width))
}
 
object BloomSet {
  def alloc(size: Int) = {
    (0 until size).foldLeft(Vector[Boolean]()) { (c, i) => c + false }
  }
}

We’ll keep all of our utility functions in the companion object.  There’s really no need to put them in the main class, and the JVM is able to optimize dispatch to static members to a slightly higher degree than it can instance members (invokestatic).  Technically, we don’t really need the alloc function, it just hides some of the boiler-plate involved in initializing a vector to a certain length.  Note that this operation is extremely fast, even for high values of width.  Despite the fact that we are creating a huge number of Vector objects, very little memory is actually wasted.  Even at this early stage in the implementation, the advantages of persistence are evident.

A Little Math

You’ll notice that our bloom filter has three major properties: size, k, and width.  As previously mentioned, k is the number of hash functions to be used, while width is the length of the array used to store the boolean values.  These values have to be user-configurable because different applications will call for different optimizations.  For example, implementing WhatLanguage requires the storage of almost a million elements per set.  Obviously we would not want to attempt to use a bloom filter which only had a width of (say) 200.  This wouldn’t cause any errors per se (remember, bloom filters are never “full”), but the accuracy of such a filter would be almost nil.  With a million elements all hashed into a mere 200 indexes, it is likely that that every single index would have a value of true.  Such a bloom filter would return true when searching for any value, including those which were not inserted.  We said we would tolerate some false positives, but we still want the results to be meaningful.  Since the width of a bloom filter cannot be changed once data is inserted (at least, not without “losing” data), the user must be able to decide in advance how much space will be required and then instantiate BloomSet accordingly.

The value of k is an even more interesting proposition.  The intuition is that the larger our value of k (thus, the more hash functions we use), the greater the accuracy of our bloom filter and the fewer false positives we will return.  This is technically true when the bloom filter does not contain very many elements, but as more and more values are inserted, higher values of k become problematic.  Wikipedia has an excellent (and extremely readable) mathematical analysis of this and other properties of bloom filters.  As it turns out, the optimal value for k (one which will yield the fewest false positives) can be computed by the following expression (where m is the width and n is the size, or the number of elements contained in the filter):

k = \mbox{ln}(2) \; \frac{m}{n}

For the sake of quick-computation, this is essentially equivalent to:

k \approx \frac{9m}{13n}

The equivalence isn’t precise (logarithms are almost always irrational), but since we’re already rounding the result to the nearest integer, we don’t need to be too concerned.

While it would be nice if we could just auto-magically set k to the optimum value and be done with it, that just isn’t the case.  Like width, k cannot be changed after data has been inserted.  Unfortunately, its optimum value is dependent on the number of items which will eventually be in the set.  Since we don’t really “know” this value within the data structure until after the insertions are complete, there is no way for us to self-optimize.  To provide maximum configurability, we need to allow this value to be set by the API consumer on a case-by-case basis.

As an interesting aside, we can actually compute the accuracy of a bloom filter on a scale from 0 to 1 given values for width, k and size.  In this case, accuracy is defined as the odds that a search result is indeed accurate (the chance of it not being a false positive).  Wikipedia gives the following expression for this accuracy:

1 – (1 – e^{-kn / m})^k

Thanks to the marvels of approximation, this is a fairly efficient computation on modern platforms.  Translated into Scala, it looks something like this:

lazy val accuracy = {
  val exp = ((k:Double) * size) / width
  val probability = Math.pow(1 - Math.exp(-exp), k)
 
  1d - probability
}

Core Functionality

As interesting as this minor detour has been, we still haven’t arrived at a working implementation of the bloom filter data structure.  We have the basic structure in place, but we still need to implement the two core operations: insert and search, rendered in Scala as + and contains:

def +(e: A) = new BloomSet[A](size + 1, k, add(contents)(e))
 
def contains(e: A) = {
  (0 until k).foldLeft(true) { (acc, i) => 
    acc && contents(hash(e, i, width)) 
  }
}
 
protected def add(contents: Vector[Boolean])(e: Any) = {
  var back = contents
 
  for (i <- 0 until k) {
    back = back(hash(e, i, width)) = true
  }
 
  back
}

Feels sort of anti-climactic doesn’t it?  After all that analysis, we find that it is possible to implement the essence of a bloom filter in less than 17 lines of code.  Technically, I probably could trim this down to more like 8 lines, but this isn’t an exercise in Scala golf.  These three methods are just the materialization of all of the hand-wavy descriptions that we’ve spent the last five pages working through.  Each time we insert an element, we calculate k different hash values and set the appropriate vector indexes to true.  To find that element, we calculate the same k hashes and ensure that all of the indexes have an appropriate boolean value.

Hashing

Those of you who are actually reading the code snippets rather than just blindly trusting my veracity will probably notice that I’ve omitted a fairly important function from the above sample; specifically: hash.  Somehow, we need to define a function which will simulate the existence of a potentially unlimited number of hash functions.  Based on our usage of this function, we need to be able to grab a hash value for a given value and function index, restricted to a given width.  This is not a trivial problem.

As a good developer, you should already be trying to think up ways to reduce the code involved in implementing these requirements.  I mean, we could potentially just dream up a few dozen different hash functions with a fairly wide distribution of results.  However, I don’t know about you, but I really don’t have that much imagination.  I have a hard enough time developing just one solid hash function, much less dozens.  What’s more, if each hash function corresponds with a different method, then we can’t really scale the number of hash functions to any integer k.

A better approach would be to create a single hash function and then skew its value n times for every n between 0 and k.  Like most things that have to do with hashing, it’s cheating, but providing we do our job right, it should give us a fairly decent way of computing the required k hash functions.  Of course, the key phrase there is “do our job right”.  We need to make sure that we skew the initial hash value sufficiently for each successive hash, otherwise successive hash functions might get “stuck in a groove”.  For example, let’s assume that each successive hash function is equivalent to the previous hash + 2.  If we then inserted two elements into the bloom filter with k = 3, there would be the potential that the hash(0) for the first element would be precisely 2 greater than the hash(0) for the second.  If this is the case, then hash(1) for the second would equal hash(0) for the first.  Thus, the first and second elements would collide on two out of three indexes.  Not exactly an optimal distribution.

To avoid this problem, we should be sure to skew the hash value by a different amount for each hash function.  A convenient way of doing this is to use the number of iterations, or the index of the particular hash function.  To further improve the distribution, we will also use the bitwise XOR operation rather than simple addition.  It’s not really intuitively obvious, but XOR is mathematically far superior for hashing and just as cheap in terms of processor cycles.  Put it all together, and the implementation looks something like this:

def hash(e: Any, iters: Int, bounds: Int): Int = {
  Math.abs(
    if (iters == 0) e.hashCode 
    else iters ^ hash(e, iters - 1, bounds)
    ) % bounds
}

The final bit of math in this function mods our skewed hash value with the given bounds.  Whenever we call hash, we always pass the width of the bloom filter.  This is a trick that you’ll see a lot in any hash-based data structure.  Essentially, we’re just mapping our hash value into the required domain.  Hash values can be anything between -231 and 231 – 1, but our vector indexes can only be between 0 and width - 1.  Just a rule of thumb, but any time you have a problem where you must “map values from a large domain to a more restricted one”, mod is generally the correct approach.

Observant readers will notice that our implementation of hash is a little inefficient.  Actually, it’s worse than just a “little” inefficient: it’s downright terrible.  Remember that we have to call this method once for every integer n between 0 and k.  This isn’t so bad, since k is usually comparatively small, but the real problem is our sneaky use of recursion within hash.  The time complexity of hash is precisely O(n), where n is the number of iterations.  Once you factor this in with the number of times we call this function and it is easy to see how the total complexity of computing k hash values is actually O(k2).  Thus, for k = 4, we actually call hash 16 times!  You can imagine how this could get very problematic for higher values of k.

Technically, we could rewrite add and contains to compute the hash values incrementally from the bottom up.  This would bring the hashing complexity down to O(k).  However, this is a bit more elegant in terms of code, not to mention easier to talk about in a modular fashion.  In practice, this inefficiency is only a problem for higher values of k due to the fact that computing a single-step hash value is actually a very fast operation.  We will run into trouble in a little bit though when we attempt to optimize the bloom sets for shorter wordlists in WhatLanguage.

Serialization

One of the important requirements of WhatLanguage that we haven’t really touched on so far is the ability to store freeze-dried bloom filters on disk and then suck them back into memory quickly and efficiently.  If you think about it, this requirement only makes sense since one of our primary motivations from the start was to avoid the burdens associated with huge word lists.  After all, if our in-memory representation of a language is only a few hundred kilobytes, shouldn’t the on-disk representation be no worse?  Furthermore, if we had to parse a wordlist into a bloom filter every time we wanted to check a string’s language, the performance imposition would be utterly prohibitive.

Fortunately, there is a fairly obvious format which we could use to efficiently serialize an arbitrary bloom filter.  Recall that the real heart of a bloom filter is the array of boolean flags, an array which can also be represented as an array of bits.  We could very easily select these bits eight at a time, combine them together using bit-shifting and produce a byte value which could be sent to an OutputStream.  Thus, a bloom filter with an array of {true, false, true, true, true, false, false, true, false} would be stored as the following series of bits: 101110010, which in hexadecimal byte form becomes B9 00.

In addition to the array, we also need to store the size, k, and width of the bloom filter, otherwise it will be impossible to find anything reliably in the deserialized result.  The Scala store and load methods to accomplish this are verbose, but fairly straightforward:

class BloomSet[A] ... {
  ...
 
  def store(os: OutputStream) {
    os.write(convertToBytes(size))
    os.write(convertToBytes(k))
    os.write(convertToBytes(contents.length))
 
    var num = 0
    var card = 0
    for (b <- contents) {
      num = (num << 1) | (if (b) 1 else 0)    // construct mask
      card += 1
 
      if (card == 8) {
        os.write(num)
 
        num = 0
        card = 0
      }
    }
 
    if (card != 0) {
      os.write(num)
    }
  }
}
 
object BloomSet {
  def load[A](is: InputStream) = {
    val buf = new Array[Byte](4)
 
    is.read(buf)
    val size = convertToInt(buf)
 
    is.read(buf)
    val k = convertToInt(buf)
 
    is.read(buf)
    val width = convertToInt(buf)
 
    var contents = Vector[Boolean]()
    for (_ <- 0 until (width / 8)) {
      var num = is.read()
      var buf: List[Boolean] = Nil
 
      for (_ <- 0 until 8) {
        buf = ((num & 1) == 1) :: buf
        num >>= 1
      }
 
      contents = contents ++ buf
    }
 
    if (width % 8 != 0) {
      var buf: List[Boolean] = Nil
      var num = is.read()
 
      for (_ <- 0 until (width % 8)) {
        buf = ((num & 1) == 1) :: buf
        num >>= 1
      }
 
      contents = contents ++ buf
    }
 
    new BloomSet[A](size, k, contents)
  }
 
  ...
}

It is interesting to note that store is a good example of what David MacIver meant when he says that functional code image good code.  I actually tried to write this in a functional style to start with, but I gave up after it became horribly ugly (having four nested folds is never a good sign).  The imperative rendering (in this case) is concise and elegant, one of the many advantages of a hybrid languages like Scala.

This particular format for rendering a bloom set is almost precisely identical to the format used by Peter Cooper’s Bloomin’ Simple, the library which sits underneath WhatLanguage.  However, due to hashing differences it is impossible to use his serialized bloom filters in Scala or vice versa.

WhatLanguage in Scala

So now that we have our magic bloom filter ready and waiting, it’s time to take a crack at Peter’s fancy language identifier!  As it turns out, this is a lot simpler than the bloom filter itself.  Conceptually, all we have to do is load a different bloom filter for each language and then use that bloom filter to check each input string token individually.  For the sake of uniformity, we will convert all tokens to lower-case.

Of course, before we can load the language bloom filters we need to first store them, and we cannot store a bloom filter before it is generated.  To that end, we need to create a simple Scala script (yes, a script) which reads in a file of newline-delimited words, inserts them all into an instance of BloomSet and then stores the result in a corresponding language file.  Scala’s scripting functionality is actually quite good, especially for a statically typed language, and really deserves some better recognition.  The complete language generator script is as follows:

import com.codecommit.collection.BloomSet
import java.io.{BufferedOutputStream, File, FileOutputStream}
import scala.io.Source
 
val WIDTH = 2000000
 
def computeK(lines: Int) = (((9:Double) * WIDTH) / ((13:Double) * lines)).intValue
 
for (file <- new File("wordlists").listFiles) {
  if (!file.isDirectory) {
    println(file.getName)
    println("==========================")
 
    val src = Source.fromFile(file)
    val count = src.getLines.foldLeft(0) { (i, line) => i + 1 }
    println("  Word count: " + count)
 
    val optimalK = computeK(count)
    val init = new BloomSet[String](WIDTH, Math.min(optimalK, 100))
 
    println("  Optimal K: " + optimalK)
    println("  Actual K: " + init.k)
 
    val set = src.reset.getLines.foldLeft(init) { _ + _.trim }
 
    println("  Accuracy: " + set.accuracy)
 
    val os = new BufferedOutputStream(
        new FileOutputStream("src/main/resources/lang/" + file.getName))
    set.store(os)
    os.close()
 
    println()
  }
}

I originally created this script using JRuby, thinking of course that it would be much easier to perform a simple, one-off task like this in a dynamically typed language.  Interestingly enough, the JRuby version was about twice as long and actually took dramatically longer to run.  By “dramatically” I mean on the order of seventy times longer.  The above Scala script takes just over 30 seconds to run on my machine using the wordlists from WhatLanguage.  This stands in stark contrast to the 35 minutes required for the JRuby script.  Both scripts use the same underlying data structure to perform all the work (BloomSet), so it’s really hard to claim that one implementation was fundamentally slower than the other.  In short: JRuby does not seem terribly well-suited for computationally-intensive tasks, even when most of that computation is taking place in Java-land.  Anyway…

This script is fairly straightforward.  The one interesting landmark is the fact that it actually makes two separate passes over the wordlist file.  The first pass just counts the number of lines, while the second pass reads the words individually, converts them to lower-case and stores them in the bloom filter.  This two-pass approach allows us to calculate the optimal value for k in our bloom filter.  This contrasts with the Ruby implementation of WhatLanguage, which just uses 3 as a global default (the width is the same in both libraries).  According to my math, this per-language optimization actually ekes out an average 2-3% better accuracy in the bloom filters.  This means fewer false positives and nominally better language detection.

I say “nominally” because such a miniscule improvement in bloom filter performance actually makes very little difference in the grand scheme of things.  Most strings are decidedly one language or another, meaning that results should be essentially identical between the two implementations.  Regardless, it’s the thought that counts.  (right?)

Just for the curious, the script produces the following output when run:

  dutch
  ==========================
    Word count: 222908
    Optimal K: 6
    Actual K: 6
    Accuracy: 0.986554232499401

  english
  ==========================
    Word count: 234936
    Optimal K: 5
    Actual K: 5
    Accuracy: 0.9827068384240777

  farsi
  ==========================
    Word count: 339747
    Optimal K: 4
    Actual K: 4
    Accuracy: 0.9408664843803285

  french
  ==========================
    Word count: 629569
    Optimal K: 2
    Actual K: 2
    Accuracy: 0.7817441534106517

  german
  ==========================
    Word count: 298729
    Optimal K: 4
    Actual K: 4
    Accuracy: 0.9590696917792074

  pinyin
  ==========================
    Word count: 399
    Optimal K: 3470
    Actual K: 100
    Accuracy: 1.0

  portuguese
  ==========================
    Word count: 386393
    Optimal K: 3
    Actual K: 3
    Accuracy: 0.9148904670664727

  russian
  ==========================
    Word count: 951830
    Optimal K: 1
    Actual K: 1
    Accuracy: 0.6213162918878307

  spanish
  ==========================
    Word count: 595946
    Optimal K: 2
    Actual K: 2
    Accuracy: 0.7984358472107859

  swedish
  ==========================
    Word count: 54818
    Optimal K: 25
    Actual K: 25
    Accuracy: 0.9999999755910407

I’m not sure why Pinyin only has a vocabulary of 399 words, but that seems to be the way things are.  This means of course that we’re storing barely 400 elements in a bloom filter with a width of 2,000,000.  Needless to say, I wasn’t surprised to see that the optimal k was in the mid thousands.  Unfortunately, this is where our inefficient hash implementation comes back to bite us.  I discovered that if I were to allow a k value of 3470, even when only inserting 399 elements, the time required to process just the single language was upwards of 20 minutes.  This may have something to do with the fact that 34702 is very, very large.

To get around this problem, I cheat and cap the k value at 100 in the script.  This still produces a computed accuracy of 100% in the bloom filter, and it takes much less time to process.  A fast hash is actually even more important during lookups.  Considering that we have to check each token in a string against each and every supported language, a fast lookup is crucial to ensuring performance of the final product.  If it took even 10 ms to compute the hash of a single token when checking against the Pinyin language, the implementation would be completely unusable.

A Little API Design

Now that we have our word lists encoded as bloom filters, we can turn our attention to slightly more important problems; specifically: what do we want the API to look like?  I decided to base my Scala implementation primarily on Peter’s Ruby API.  However, there is one small wrinkle in this plan: WhatLanguage uses symbols for everything.  That’s fine and in line with Ruby idioms, and technically we could use Scala symbols if we really wanted to, but Scala is a statically typed language with a lot of powerful idioms of its own.  It would be a lot more conventional if we found a type-safe way of representing the same concepts.  To that end, I decided to go with a similar (but not identical) approach as exemplified in the following code snippet:

import com.codecommit.lang._
 
"Hello, this is a test".language       // => english
 
val wl = new WhatLanguage(english, french, spanish)
val text = "Bonjour, my name is Daniel. Estoy bien. Como estas? Êtes-vous ennuyer?"
 
wl.processText(text)   // => Map(English -> 4, French -> 5, Spanish -> 6)

The neat trick of the day is our use of static values in scope as a form of type-safe symbol.  We create a new instance of WhatLanguage, passing it three instances of class Language representing (oddly enough) the languages we wish it to use during the analysis.  We can also use “all” as a shortcut for enumerating every supported language.

Another cute little API feature is our use of the lang “package” to bring everything into scope, including two implicit conversions and a small boat-load of language values.  This is made possible by the fact that lang is not actually a package but a singleton object.  Even Scala does not allow values and functions as top-level elements in a package, but it does allow them within objects.  Our import at the head of the snippet is actually equivalent to the following bit of Java:

import static com.codecommit.lang.*;

With all of this in mind, the rest of the library is fairly easy to create.  All we need to do is design an algorithm which splits an input string into tokens, loops over each one keeping track of how many tokens are matched by each language, and finally selects the highest-scoring language from the result and returns its corresponding static value.  As one would expect, this is easily accomplished:

package com.codecommit
 
import java.io.BufferedInputStream
import com.codecommit.collection.BloomSet
 
import scala.collection.mutable.ArrayBuffer
 
object lang {
  case object all extends Language("all") {
    val langs = List(dutch, english, farsi, french, german, pinyin, 
                     portuguese, spanish, swedish)
 
    override lazy val words = BloomSet[String]()
 
    override val toString = "All languages meta-variable"
  }
 
  val dutch = Language("dutch")
  val english = Language("english")
  val farsi = Language("farsi")
  val french = Language("french")
  val german = Language("german")
  val pinyin = Language("pinyin")
  val portuguese = Language("portuguese")
  val russian = Language("russian")
  val spanish = Language("spanish")
  val swedish = Language("swedish")
 
  implicit def conversion(str: String) = new {
    val language = new WhatLanguage(all).language(str).getOrElse(null)
  }
 
  implicit def languageToString(lang: Language): String = lang.toString
 
  sealed case class Language private[lang] (file: String) {
    lazy val words = {
      val is = new BufferedInputStream(getClass.getResourceAsStream("/lang/" + file))
 
      try {
        BloomSet.load[String](is)
      } finally is.close()
    }
 
    override val toString = file(0).toUpperCase + file.substring(1)
 
    override def equals(other: Any) = other match {
      case lang: Language => toString == lang.toString
      case _ => false
    }
 
    override val hashCode = file.hashCode
  }
 
  class WhatLanguage(langs: Language*) {
 
    def language(str: String) = {
      val back = new ArrayBuffer[Language]
      var max = 0
 
      for ((lang, score) <- processText(str)) {
        if (score > max) {
          back.clear
 
          back += lang
          max = score
        } else if (score == max) {
          back += lang
        }
      }
 
      if (back.length == 1) Some(back(0)) else None
    }
 
    def processText(str: String) = {
      val langs = if (this.langs.contains(all)) all.langs else this.langs
      val prime = langs.foldLeft(Map[Language, Int]()) { _(_) = 0 }
 
      str.split("""\s+""").foldLeft(prime) { (map, token) =>
        langs.foldLeft(map) { (map, lang) =>
          if (lang.words.contains(token.toLowerCase)) {
            map(lang) = map(lang) + 1 
          } else map
        }
      }
    }
  }
}

Conclusion

Well, it’s been a fun and informative journey through the fertile land of bloom sets and simple statistical language analysis.  As usual, all of the source (and binaries) are available for download.  It’s a bit of a shame, but I didn’t get a chance to discuss a number of rather interesting features also available in my bloom filter implementation (such as concatenation of bloom filters).

Bloom filters are quite interesting in and of themselves, and definitely a useful technique that every developer should keep in their back pocket.  Peter Cooper definitely merits kudos for his implementation of WhatLanguage based on bloom filters, but despite the fact that I took the time to port it to Scala, I still don’t see much practical benefit.  At any rate, the library is now available in Scala; hopefully someone else will find it more useful than I do!