Skip to content

Bloom Filters in Scala (and all the fun that they bring)

13
Oct
2008

This is a post I have been meaning to write for quite some time.  Despite being an officially excommunicated former-member of the Ruby community, I still like to keep up on the latest developments.  A few months ago, my Ruby Inside feed produced an interesting post introducing a new library by Peter Cooper (the author of Ruby Inside).  Unlike the most “new” Ruby libraries which are most often just bad clones of Rails snibblets, Peter’s WhatLanguage seemed to be quite unique and interesting.  For the first time in years, I was reading about an innovation in Ruby-land which actually sounded quite exciting.  A good feeling to be sure.

In a nutshell, WhatLanguage (project page at GitHub) is a statistical language identification API which analyses arbitrary input String(s), scoring them against a set of built-in languages (including English, French, Spanish, Russian, and others).  Add to that a little bit of meta-programming magic and the possibilities are eye-catching to say the least:

require 'whatlanguage'
 
'Here is some text to identify'.language       # => :english
'¿Hola, pero donde está su pelo?'.language     # => :spanish

To be honest, I’m not sure what practical application I would have for such a library, but it sure looks like fun!  :-)  Under the surface, the implementation is deceptively simple.  WhatLanguage bundles a set of words for each language it supports.  The implementation then searches these sets for each word in the string, aggregating a score for each language depending on whether or not the word was found.  If exactly one language has a higher score than the others once the end of the string is reached, it is “declared the winner” and returned as a result.  Otherwise, the analysis is undecided and nil is returned.  Simple, right?

The tricky part is somehow storing and searching those gigantic language word sets in an efficient manner.  As it turns out, this isn’t so easy.  Just stop and consider the number of words which exist in your native tongue.  Unless you speak a very strange, primarily-tonal language, the number is likely in the upper hundreds of thousands.  Just going off of Peter’s word lists, Russian is the real trouble maker, weighing in with a nearly one million word vocabulary.  Any attempt to store that much data would require megabytes of memory and several minutes to process, even on a fast machine.  Any sort of hash set is completely out of the question due to the nature of contiguous memory allocation, but can you imagine trying to perform a search traversal on a tree with almost a million nodes?  Obviously, a more effective solution is required.

Probabilistic Sets

So far we have been focusing our imaginary efforts on data structures which guarantee 100% accuracy and full reversibility.  That is to say, data structures which allow you to get out what you put in and which ensure that if something is “in” it will be found and if something is “out” it will not.  After all, is a data structure without these guarantees really all that useful?

As it turns out, the answer is “yes”.  Consider a standard hash set as an example.  Sets make absolutely no guarantee about the ordering of their elements.  If you attempt to iterate over a set, you will eventually reach everything inside, but you cannot rely on a certain element coming before the others.  Of course, the advantage we gain by making this trade-off is constant time searching by item identity.  Lists cannot achieve this performance, neither can arrays.  If you want to search either one of these structures, you must look through every element from start to finish.  So with hash sets, we have effectively traded a fundamental guarantee about our data for efficiency in some other respect.

The same is true for probabilistic sets.  We don’t actually need all of the properties of a set to implement WhatLanguage.  For example, we don’t need to actually retrieve a word from our dictionary, we only need to discover whether or not that word is in the set.  In fact, we really only need our data structure to have two important properties:

  • Lightning fast boolean searches (does the set contain this element?)
  • Minimal memory footprint

Hash sets can make the first guarantee, but as we have already noticed, they don’t really require any less memory than just storing everything in an array.  Since we can’t afford one million words-worth of memory per language, we need to come up with some way of storing information about the presence of an element without retaining the data it represents.  In order to do this, we have to make one important sacrifice: accuracy.

We basically need to accept the fact that our data structure may sometimes return inaccurate results.  As it turns out though, this is just fine.  Remember, we’re scoring languages one token at a time; an incorrect answer now and again isn’t going to kill us.  Given a large enough input string, a minimal percentage of inaccuracy will literally be lost in the noise.

As it turns out, there already exists a very old technique for accomplishing exactly these requirements.  Bloom filters have been around since the beginning of time (or at least the 1980s), but unfortunately very few developers are aware of them.  In a nutshell, they precisely satisfy all of the needs we just finished enumerating.  They provide true (not amortized) constant-time searches and will consistently return true or false for a given element, depending on whether or not the value is in the set.  Occasionally they will give false positives (claiming that an element is in the set when in fact it is not), but never false negatives.  This property is extremely useful in many applications, but it turns out that this is even more than we need for WhatLanguage.

Most importantly though, bloom filters are extremely compact in memory.  Rather than storing an entire value, they hash every element to one or (usually) more indexes in an array of bits (0 or 1).  In an optimal bloom filter guaranteeing no false positives, each element would correspond to precisely k bits of memory (where k is the number of hash functions, usually in the vicinity of 3 or 4).  Once this hashing is complete, the value itself may be discarded.  Thus, assuming an optimal bloom filter with 3 hash functions, the storage requirements for one million words would only be 375 KB of memory.  Compare this to the 8 MB required just to store the pointers to each of the words, much less the values themselves!  If we include the underlying character values, we start pushing closer to 20 MB.

375 KB for a million elements is impressive, but bloom filters can actually do a lot better than that.  Remember, we agreed to accept some false positives.  It’s easy to imagine how even an ideal bloom filter (such as the one we have been analyzing) could have some inaccuracy, but if we’re willing to live with a few more false positives, we can allow hash collisions and drop our memory requirements into the 200 KB range.  Not bad, considering how much data we’re storing!

Bloom Filters

So how does the magic actually work?  We’ve hinted at the implementation in our analysis of memory requirements, but we’re still missing some important details.  Consider the following diagram (courtesy of Wikipedia):

Under the surface, a bloom filter is just an array of boolean values.  The length of the array is referred to as the width of the bloom filter.  When the filter is empty, all of the array values are set to false.  To insert an element, we compute k different hash indexes in the array and set them all to true.  The real beauty of the data structure is the manner in which we handle collisions: we don’t worry about it.  Whenever one (or more) of the hash functions returns an index which is already true, we simply leave that index as-is and move on.  Thus, the procedure for inserting an element into a “full” bloom filter is precisely the same as inserting into an empty one: set all of the hash indexes to true.  In a sense, bloom filters can never be full, they just get less and less accurate.

Bloom filters define only one operation other than insertion: searching.  To determine whether or not a value is in the set, we run all of our hash functions and find the indexes to which it corresponds.  If all of these indexes are true, then we can assume that the value is indeed in the set.  Conversely, if any one of indexes is false, then the value cannot be in the set since we would have set that value to true upon insertion.

It is important to note that removal is not supported by bloom filters.  Intuitively we could just take the value to be removed, calculate its hash indexes and then set the corresponding elements to false.  However, if any other values also hash to even one of those locations, we will be effectively removing it from the bloom filter “by accident”.  Subsequent searches for that value will hash to the index which we erroneously set to false, resulting in a false negative on the search (saying that an element is not in the set when in fact it is).  Fortunately for us, WhatLanguage only requires insertion and searching for values, removal is not on the agenda.

Scala Implementation

Well, this has probably been a nostalgic (for those of you who enjoyed your university days) stroll through the land of data structure theory, but we’re still lacking one critical factor: a concrete implementation.  Bloom filters don’t sound that complicated on paper, but to do anything useful with them we’re going to need some code.

Before we dive in, we might want to define some more precise requirements for our implementation.  We’re writing this in Scala, so it would be nice to implement bloom filters as a functional data structure.  I’m not going to get into all the benefits of immutable data structures; suffice it to say, I really think that they’re the only way to go for most serious applications.  Unfortunately, in order to do this, we’re going to need some sort of immutable, array-like data structure under the surface which provides random access in constant time.  Sound familiar?

This is actually where the whole saga of persistent vectors in Scala found its roots.  The first time I tried to implement a functional bloom filter in Scala I made use of a conventional array under the surface with copy-on-write semantics.  As you can imagine, this is hideously inefficient in both space and time.  In order to solve the problem, I had to implement a data structure which was just as efficient for reading and writing, but which also shared structure with subsequent versions.  Now, two months later, I’ve borrowed just such a design from Clojure and created a first-class implementation in Scala.

With a persistent vector now at our disposal, we can easily implement a persistent bloom filter by using a vector to represent our boolean indexes.  Every time we perform an insertion, we will “modify” the relevant indexes, creating a new instance of the vector.  This vector can then be used in our “modified” bloom filter, returned as a new instance of our container class.  Since persistent vectors share unmodified structure between versions, so does our bloom filter.  All of the hard work of efficient immutability is already done for us, we just need to implement a thin facade.  We start out by defining the class skeleton:

import BloomSet._
 
class BloomSet[A] private (val size: Int, val k: Int, 
                           private val contents: Vector[Boolean]) extends ((A)=>Boolean) {
  val width = contents.length
 
  def this(width: Int, k: Int) = this(0, k, alloc(width))
}
 
object BloomSet {
  def alloc(size: Int) = {
    (0 until size).foldLeft(Vector[Boolean]()) { (c, i) => c + false }
  }
}

We’ll keep all of our utility functions in the companion object.  There’s really no need to put them in the main class, and the JVM is able to optimize dispatch to static members to a slightly higher degree than it can instance members (invokestatic).  Technically, we don’t really need the alloc function, it just hides some of the boiler-plate involved in initializing a vector to a certain length.  Note that this operation is extremely fast, even for high values of width.  Despite the fact that we are creating a huge number of Vector objects, very little memory is actually wasted.  Even at this early stage in the implementation, the advantages of persistence are evident.

A Little Math

You’ll notice that our bloom filter has three major properties: size, k, and width.  As previously mentioned, k is the number of hash functions to be used, while width is the length of the array used to store the boolean values.  These values have to be user-configurable because different applications will call for different optimizations.  For example, implementing WhatLanguage requires the storage of almost a million elements per set.  Obviously we would not want to attempt to use a bloom filter which only had a width of (say) 200.  This wouldn’t cause any errors per se (remember, bloom filters are never “full”), but the accuracy of such a filter would be almost nil.  With a million elements all hashed into a mere 200 indexes, it is likely that that every single index would have a value of true.  Such a bloom filter would return true when searching for any value, including those which were not inserted.  We said we would tolerate some false positives, but we still want the results to be meaningful.  Since the width of a bloom filter cannot be changed once data is inserted (at least, not without “losing” data), the user must be able to decide in advance how much space will be required and then instantiate BloomSet accordingly.

The value of k is an even more interesting proposition.  The intuition is that the larger our value of k (thus, the more hash functions we use), the greater the accuracy of our bloom filter and the fewer false positives we will return.  This is technically true when the bloom filter does not contain very many elements, but as more and more values are inserted, higher values of k become problematic.  Wikipedia has an excellent (and extremely readable) mathematical analysis of this and other properties of bloom filters.  As it turns out, the optimal value for k (one which will yield the fewest false positives) can be computed by the following expression (where m is the width and n is the size, or the number of elements contained in the filter):

k = \mbox{ln}(2) \; \frac{m}{n}

For the sake of quick-computation, this is essentially equivalent to:

k \approx \frac{9m}{13n}

The equivalence isn’t precise (logarithms are almost always irrational), but since we’re already rounding the result to the nearest integer, we don’t need to be too concerned.

While it would be nice if we could just auto-magically set k to the optimum value and be done with it, that just isn’t the case.  Like width, k cannot be changed after data has been inserted.  Unfortunately, its optimum value is dependent on the number of items which will eventually be in the set.  Since we don’t really “know” this value within the data structure until after the insertions are complete, there is no way for us to self-optimize.  To provide maximum configurability, we need to allow this value to be set by the API consumer on a case-by-case basis.

As an interesting aside, we can actually compute the accuracy of a bloom filter on a scale from 0 to 1 given values for width, k and size.  In this case, accuracy is defined as the odds that a search result is indeed accurate (the chance of it not being a false positive).  Wikipedia gives the following expression for this accuracy:

1 - (1 - e^{-kn / m})^k

Thanks to the marvels of approximation, this is a fairly efficient computation on modern platforms.  Translated into Scala, it looks something like this:

lazy val accuracy = {
  val exp = ((k:Double) * size) / width
  val probability = Math.pow(1 - Math.exp(-exp), k)
 
  1d - probability
}

Core Functionality

As interesting as this minor detour has been, we still haven’t arrived at a working implementation of the bloom filter data structure.  We have the basic structure in place, but we still need to implement the two core operations: insert and search, rendered in Scala as + and contains:

def +(e: A) = new BloomSet[A](size + 1, k, add(contents)(e))
 
def contains(e: A) = {
  (0 until k).foldLeft(true) { (acc, i) => 
    acc && contents(hash(e, i, width)) 
  }
}
 
protected def add(contents: Vector[Boolean])(e: Any) = {
  var back = contents
 
  for (i <- 0 until k) {
    back = back(hash(e, i, width)) = true
  }
 
  back
}

Feels sort of anti-climactic doesn’t it?  After all that analysis, we find that it is possible to implement the essence of a bloom filter in less than 17 lines of code.  Technically, I probably could trim this down to more like 8 lines, but this isn’t an exercise in Scala golf.  These three methods are just the materialization of all of the hand-wavy descriptions that we’ve spent the last five pages working through.  Each time we insert an element, we calculate k different hash values and set the appropriate vector indexes to true.  To find that element, we calculate the same k hashes and ensure that all of the indexes have an appropriate boolean value.

Hashing

Those of you who are actually reading the code snippets rather than just blindly trusting my veracity will probably notice that I’ve omitted a fairly important function from the above sample; specifically: hash.  Somehow, we need to define a function which will simulate the existence of a potentially unlimited number of hash functions.  Based on our usage of this function, we need to be able to grab a hash value for a given value and function index, restricted to a given width.  This is not a trivial problem.

As a good developer, you should already be trying to think up ways to reduce the code involved in implementing these requirements.  I mean, we could potentially just dream up a few dozen different hash functions with a fairly wide distribution of results.  However, I don’t know about you, but I really don’t have that much imagination.  I have a hard enough time developing just one solid hash function, much less dozens.  What’s more, if each hash function corresponds with a different method, then we can’t really scale the number of hash functions to any integer k.

A better approach would be to create a single hash function and then skew its value n times for every n between 0 and k.  Like most things that have to do with hashing, it’s cheating, but providing we do our job right, it should give us a fairly decent way of computing the required k hash functions.  Of course, the key phrase there is “do our job right”.  We need to make sure that we skew the initial hash value sufficiently for each successive hash, otherwise successive hash functions might get “stuck in a groove”.  For example, let’s assume that each successive hash function is equivalent to the previous hash + 2.  If we then inserted two elements into the bloom filter with k = 3, there would be the potential that the hash(0) for the first element would be precisely 2 greater than the hash(0) for the second.  If this is the case, then hash(1) for the second would equal hash(0) for the first.  Thus, the first and second elements would collide on two out of three indexes.  Not exactly an optimal distribution.

To avoid this problem, we should be sure to skew the hash value by a different amount for each hash function.  A convenient way of doing this is to use the number of iterations, or the index of the particular hash function.  To further improve the distribution, we will also use the bitwise XOR operation rather than simple addition.  It’s not really intuitively obvious, but XOR is mathematically far superior for hashing and just as cheap in terms of processor cycles.  Put it all together, and the implementation looks something like this:

def hash(e: Any, iters: Int, bounds: Int): Int = {
  Math.abs(
    if (iters == 0) e.hashCode 
    else iters ^ hash(e, iters - 1, bounds)
    ) % bounds
}

The final bit of math in this function mods our skewed hash value with the given bounds.  Whenever we call hash, we always pass the width of the bloom filter.  This is a trick that you’ll see a lot in any hash-based data structure.  Essentially, we’re just mapping our hash value into the required domain.  Hash values can be anything between -231 and 231 - 1, but our vector indexes can only be between 0 and width - 1.  Just a rule of thumb, but any time you have a problem where you must “map values from a large domain to a more restricted one”, mod is generally the correct approach.

Observant readers will notice that our implementation of hash is a little inefficient.  Actually, it’s worse than just a “little” inefficient: it’s downright terrible.  Remember that we have to call this method once for every integer n between 0 and k.  This isn’t so bad, since k is usually comparatively small, but the real problem is our sneaky use of recursion within hash.  The time complexity of hash is precisely O(n), where n is the number of iterations.  Once you factor this in with the number of times we call this function and it is easy to see how the total complexity of computing k hash values is actually O(k2).  Thus, for k = 4, we actually call hash 16 times!  You can imagine how this could get very problematic for higher values of k.

Technically, we could rewrite add and contains to compute the hash values incrementally from the bottom up.  This would bring the hashing complexity down to O(k).  However, this is a bit more elegant in terms of code, not to mention easier to talk about in a modular fashion.  In practice, this inefficiency is only a problem for higher values of k due to the fact that computing a single-step hash value is actually a very fast operation.  We will run into trouble in a little bit though when we attempt to optimize the bloom sets for shorter wordlists in WhatLanguage.

Serialization

One of the important requirements of WhatLanguage that we haven’t really touched on so far is the ability to store freeze-dried bloom filters on disk and then suck them back into memory quickly and efficiently.  If you think about it, this requirement only makes sense since one of our primary motivations from the start was to avoid the burdens associated with huge word lists.  After all, if our in-memory representation of a language is only a few hundred kilobytes, shouldn’t the on-disk representation be no worse?  Furthermore, if we had to parse a wordlist into a bloom filter every time we wanted to check a string’s language, the performance imposition would be utterly prohibitive.

Fortunately, there is a fairly obvious format which we could use to efficiently serialize an arbitrary bloom filter.  Recall that the real heart of a bloom filter is the array of boolean flags, an array which can also be represented as an array of bits.  We could very easily select these bits eight at a time, combine them together using bit-shifting and produce a byte value which could be sent to an OutputStream.  Thus, a bloom filter with an array of {true, false, true, true, true, false, false, true, false} would be stored as the following series of bits: 101110010, which in hexadecimal byte form becomes B9 00.

In addition to the array, we also need to store the size, k, and width of the bloom filter, otherwise it will be impossible to find anything reliably in the deserialized result.  The Scala store and load methods to accomplish this are verbose, but fairly straightforward:

class BloomSet[A] ... {
  ...
 
  def store(os: OutputStream) {
    os.write(convertToBytes(size))
    os.write(convertToBytes(k))
    os.write(convertToBytes(contents.length))
 
    var num = 0
    var card = 0
    for (b <- contents) {
      num = (num << 1) | (if (b) 1 else 0)    // construct mask
      card += 1
 
      if (card == 8) {
        os.write(num)
 
        num = 0
        card = 0
      }
    }
 
    if (card != 0) {
      os.write(num)
    }
  }
}
 
object BloomSet {
  def load[A](is: InputStream) = {
    val buf = new Array[Byte](4)
 
    is.read(buf)
    val size = convertToInt(buf)
 
    is.read(buf)
    val k = convertToInt(buf)
 
    is.read(buf)
    val width = convertToInt(buf)
 
    var contents = Vector[Boolean]()
    for (_ <- 0 until (width / 8)) {
      var num = is.read()
      var buf: List[Boolean] = Nil
 
      for (_ <- 0 until 8) {
        buf = ((num & 1) == 1) :: buf
        num >>= 1
      }
 
      contents = contents ++ buf
    }
 
    if (width % 8 != 0) {
      var buf: List[Boolean] = Nil
      var num = is.read()
 
      for (_ <- 0 until (width % 8)) {
        buf = ((num & 1) == 1) :: buf
        num >>= 1
      }
 
      contents = contents ++ buf
    }
 
    new BloomSet[A](size, k, contents)
  }
 
  ...
}

It is interesting to note that store is a good example of what David MacIver meant when he says that functional code image good code.  I actually tried to write this in a functional style to start with, but I gave up after it became horribly ugly (having four nested folds is never a good sign).  The imperative rendering (in this case) is concise and elegant, one of the many advantages of a hybrid languages like Scala.

This particular format for rendering a bloom set is almost precisely identical to the format used by Peter Cooper’s Bloomin’ Simple, the library which sits underneath WhatLanguage.  However, due to hashing differences it is impossible to use his serialized bloom filters in Scala or vice versa.

WhatLanguage in Scala

So now that we have our magic bloom filter ready and waiting, it’s time to take a crack at Peter’s fancy language identifier!  As it turns out, this is a lot simpler than the bloom filter itself.  Conceptually, all we have to do is load a different bloom filter for each language and then use that bloom filter to check each input string token individually.  For the sake of uniformity, we will convert all tokens to lower-case.

Of course, before we can load the language bloom filters we need to first store them, and we cannot store a bloom filter before it is generated.  To that end, we need to create a simple Scala script (yes, a script) which reads in a file of newline-delimited words, inserts them all into an instance of BloomSet and then stores the result in a corresponding language file.  Scala’s scripting functionality is actually quite good, especially for a statically typed language, and really deserves some better recognition.  The complete language generator script is as follows:

import com.codecommit.collection.BloomSet
import java.io.{BufferedOutputStream, File, FileOutputStream}
import scala.io.Source
 
val WIDTH = 2000000
 
def computeK(lines: Int) = (((9:Double) * WIDTH) / ((13:Double) * lines)).intValue
 
for (file <- new File("wordlists").listFiles) {
  if (!file.isDirectory) {
    println(file.getName)
    println("==========================")
 
    val src = Source.fromFile(file)
    val count = src.getLines.foldLeft(0) { (i, line) => i + 1 }
    println("  Word count: " + count)
 
    val optimalK = computeK(count)
    val init = new BloomSet[String](WIDTH, Math.min(optimalK, 100))
 
    println("  Optimal K: " + optimalK)
    println("  Actual K: " + init.k)
 
    val set = src.reset.getLines.foldLeft(init) { _ + _.trim }
 
    println("  Accuracy: " + set.accuracy)
 
    val os = new BufferedOutputStream(
        new FileOutputStream("src/main/resources/lang/" + file.getName))
    set.store(os)
    os.close()
 
    println()
  }
}

I originally created this script using JRuby, thinking of course that it would be much easier to perform a simple, one-off task like this in a dynamically typed language.  Interestingly enough, the JRuby version was about twice as long and actually took dramatically longer to run.  By “dramatically” I mean on the order of seventy times longer.  The above Scala script takes just over 30 seconds to run on my machine using the wordlists from WhatLanguage.  This stands in stark contrast to the 35 minutes required for the JRuby script.  Both scripts use the same underlying data structure to perform all the work (BloomSet), so it’s really hard to claim that one implementation was fundamentally slower than the other.  In short: JRuby does not seem terribly well-suited for computationally-intensive tasks, even when most of that computation is taking place in Java-land.  Anyway…

This script is fairly straightforward.  The one interesting landmark is the fact that it actually makes two separate passes over the wordlist file.  The first pass just counts the number of lines, while the second pass reads the words individually, converts them to lower-case and stores them in the bloom filter.  This two-pass approach allows us to calculate the optimal value for k in our bloom filter.  This contrasts with the Ruby implementation of WhatLanguage, which just uses 3 as a global default (the width is the same in both libraries).  According to my math, this per-language optimization actually ekes out an average 2-3% better accuracy in the bloom filters.  This means fewer false positives and nominally better language detection.

I say “nominally” because such a miniscule improvement in bloom filter performance actually makes very little difference in the grand scheme of things.  Most strings are decidedly one language or another, meaning that results should be essentially identical between the two implementations.  Regardless, it’s the thought that counts.  (right?)

Just for the curious, the script produces the following output when run:

  dutch
  ==========================
    Word count: 222908
    Optimal K: 6
    Actual K: 6
    Accuracy: 0.986554232499401

  english
  ==========================
    Word count: 234936
    Optimal K: 5
    Actual K: 5
    Accuracy: 0.9827068384240777

  farsi
  ==========================
    Word count: 339747
    Optimal K: 4
    Actual K: 4
    Accuracy: 0.9408664843803285

  french
  ==========================
    Word count: 629569
    Optimal K: 2
    Actual K: 2
    Accuracy: 0.7817441534106517

  german
  ==========================
    Word count: 298729
    Optimal K: 4
    Actual K: 4
    Accuracy: 0.9590696917792074

  pinyin
  ==========================
    Word count: 399
    Optimal K: 3470
    Actual K: 100
    Accuracy: 1.0

  portuguese
  ==========================
    Word count: 386393
    Optimal K: 3
    Actual K: 3
    Accuracy: 0.9148904670664727

  russian
  ==========================
    Word count: 951830
    Optimal K: 1
    Actual K: 1
    Accuracy: 0.6213162918878307

  spanish
  ==========================
    Word count: 595946
    Optimal K: 2
    Actual K: 2
    Accuracy: 0.7984358472107859

  swedish
  ==========================
    Word count: 54818
    Optimal K: 25
    Actual K: 25
    Accuracy: 0.9999999755910407

I’m not sure why Pinyin only has a vocabulary of 399 words, but that seems to be the way things are.  This means of course that we’re storing barely 400 elements in a bloom filter with a width of 2,000,000.  Needless to say, I wasn’t surprised to see that the optimal k was in the mid thousands.  Unfortunately, this is where our inefficient hash implementation comes back to bite us.  I discovered that if I were to allow a k value of 3470, even when only inserting 399 elements, the time required to process just the single language was upwards of 20 minutes.  This may have something to do with the fact that 34702 is very, very large.

To get around this problem, I cheat and cap the k value at 100 in the script.  This still produces a computed accuracy of 100% in the bloom filter, and it takes much less time to process.  A fast hash is actually even more important during lookups.  Considering that we have to check each token in a string against each and every supported language, a fast lookup is crucial to ensuring performance of the final product.  If it took even 10 ms to compute the hash of a single token when checking against the Pinyin language, the implementation would be completely unusable.

A Little API Design

Now that we have our word lists encoded as bloom filters, we can turn our attention to slightly more important problems; specifically: what do we want the API to look like?  I decided to base my Scala implementation primarily on Peter’s Ruby API.  However, there is one small wrinkle in this plan: WhatLanguage uses symbols for everything.  That’s fine and in line with Ruby idioms, and technically we could use Scala symbols if we really wanted to, but Scala is a statically typed language with a lot of powerful idioms of its own.  It would be a lot more conventional if we found a type-safe way of representing the same concepts.  To that end, I decided to go with a similar (but not identical) approach as exemplified in the following code snippet:

import com.codecommit.lang._
 
"Hello, this is a test".language       // => english
 
val wl = new WhatLanguage(english, french, spanish)
val text = "Bonjour, my name is Daniel. Estoy bien. Como estas? Êtes-vous ennuyer?"
 
wl.processText(text)   // => Map(English -> 4, French -> 5, Spanish -> 6)

The neat trick of the day is our use of static values in scope as a form of type-safe symbol.  We create a new instance of WhatLanguage, passing it three instances of class Language representing (oddly enough) the languages we wish it to use during the analysis.  We can also use “all” as a shortcut for enumerating every supported language.

Another cute little API feature is our use of the lang “package” to bring everything into scope, including two implicit conversions and a small boat-load of language values.  This is made possible by the fact that lang is not actually a package but a singleton object.  Even Scala does not allow values and functions as top-level elements in a package, but it does allow them within objects.  Our import at the head of the snippet is actually equivalent to the following bit of Java:

import static com.codecommit.lang.*;

With all of this in mind, the rest of the library is fairly easy to create.  All we need to do is design an algorithm which splits an input string into tokens, loops over each one keeping track of how many tokens are matched by each language, and finally selects the highest-scoring language from the result and returns its corresponding static value.  As one would expect, this is easily accomplished:

package com.codecommit
 
import java.io.BufferedInputStream
import com.codecommit.collection.BloomSet
 
import scala.collection.mutable.ArrayBuffer
 
object lang {
  case object all extends Language("all") {
    val langs = List(dutch, english, farsi, french, german, pinyin, 
                     portuguese, spanish, swedish)
 
    override lazy val words = BloomSet[String]()
 
    override val toString = "All languages meta-variable"
  }
 
  val dutch = Language("dutch")
  val english = Language("english")
  val farsi = Language("farsi")
  val french = Language("french")
  val german = Language("german")
  val pinyin = Language("pinyin")
  val portuguese = Language("portuguese")
  val russian = Language("russian")
  val spanish = Language("spanish")
  val swedish = Language("swedish")
 
  implicit def conversion(str: String) = new {
    val language = new WhatLanguage(all).language(str).getOrElse(null)
  }
 
  implicit def languageToString(lang: Language): String = lang.toString
 
  sealed case class Language private[lang] (file: String) {
    lazy val words = {
      val is = new BufferedInputStream(getClass.getResourceAsStream("/lang/" + file))
 
      try {
        BloomSet.load[String](is)
      } finally is.close()
    }
 
    override val toString = file(0).toUpperCase + file.substring(1)
 
    override def equals(other: Any) = other match {
      case lang: Language => toString == lang.toString
      case _ => false
    }
 
    override val hashCode = file.hashCode
  }
 
  class WhatLanguage(langs: Language*) {
 
    def language(str: String) = {
      val back = new ArrayBuffer[Language]
      var max = 0
 
      for ((lang, score) <- processText(str)) {
        if (score > max) {
          back.clear
 
          back += lang
          max = score
        } else if (score == max) {
          back += lang
        }
      }
 
      if (back.length == 1) Some(back(0)) else None
    }
 
    def processText(str: String) = {
      val langs = if (this.langs.contains(all)) all.langs else this.langs
      val prime = langs.foldLeft(Map[Language, Int]()) { _(_) = 0 }
 
      str.split("""\s+""").foldLeft(prime) { (map, token) =>
        langs.foldLeft(map) { (map, lang) =>
          if (lang.words.contains(token.toLowerCase)) {
            map(lang) = map(lang) + 1 
          } else map
        }
      }
    }
  }
}

Conclusion

Well, it’s been a fun and informative journey through the fertile land of bloom sets and simple statistical language analysis.  As usual, all of the source (and binaries) are available for download.  It’s a bit of a shame, but I didn’t get a chance to discuss a number of rather interesting features also available in my bloom filter implementation (such as concatenation of bloom filters).

Bloom filters are quite interesting in and of themselves, and definitely a useful technique that every developer should keep in their back pocket.  Peter Cooper definitely merits kudos for his implementation of WhatLanguage based on bloom filters, but despite the fact that I took the time to port it to Scala, I still don’t see much practical benefit.  At any rate, the library is now available in Scala; hopefully someone else will find it more useful than I do!

Software Transactional Memory in Scala

6
Oct
2008

The fact is that there are a lot of problems that are hard to solve in a purely-functional style.  That’s not to say that no solution exists, but certain problems are very difficult to model without shared state.  In such situations, a slightly different approach to concurrency must be considered.  Actors are inapplicable, seeing as they are the embodiment of “shared-nothing” continuation passing, and fork/join doesn’t really help us.  Usually, when faced with the need for shared mutable state, most developers will resort to the old-fashioned technique of locking everything and controlling access to that state one thread at a time.  The solution often goes something like this:

public class Container {
    private int value;
    private final ReadWriteLock lock = new ReentrantReadWriteLock();
 
    public int getValue() {
        lock.readLock().lock();
        try {
            return value;
        } finally {
            lock.readLock.unlock();
        }
    }
 
    public void setValue(int value) {
        lock.writeLock().lock();
        try {
            this.value = value;
        } finally {
            lock.writeLock.unlock();
        }
    }
}

Obviously, most scenarios which call for such locking are a bit more complex, but you get the picture.  There are two very serious problems with locking:

  • Implementation rules are ad hoc
  • Throughput is reduced

The first issue is obvious: there is nothing actually preventing us from accessing value without locking.  What’s worse, it would be just as easy to lock, access value and then forget to unlock when we were done.  There is no logical connection between value and its lock, the implementation rules are solely enforced by convention.  Of course, we can avoid the problem of forgetting to unlock by using a locking strategy which is lexical in nature, but in that case our second issue (throughput) becomes even more pronounced:

private int value;
private final Object lock = new Object();
 
public int getValue() {
    synchronized (lock) {
        return value;
    }
}
 
public void setValue(int value) {
    synchronized (lock) {
        this.value = value;
    }
}

The problem here is that everything blocks everything else.  If one thread wants to retrieve value, it literally stops the world while the other threads wait their turn.  Data integrity is guaranteed, but this was accomplished only by creating a serial bottleneck on read/write access to value.  On top of that, we still have no enforced correlation between the lock and its corresponding piece of data.

“Glass is Half Full” Concurrency

What we have been doing here is called “pessimistic locking”.  We have literally started with the assumption that we are going to run into data contention issues.  Our entire locking mechanism is designed with the worst case scenario in-mind: lots of threads trying to write to the same piece of data simultaneously.  However, not every concurrent system is going to have constant interference between threads.  As it turns out, in practice data contention is the exception rather than the rule.  Given this, it would be very nice if we could design a system which gave threads the “benefit of the doubt”, assuming that they will not conflict, while still somehow maintaining data integrity.

The answer to this is “optimistic locking”.  We turn the problem completely on its head.  Instead of assuming that there will be problems and so locking everything preemptively, we assume that everything will be fine and just allow everyone free, non-blocking access.  Of course, an “open-door policy” with regards to shared mutable state isn’t enough in and of itself, we have to have some rule for dealing with contention issues, and some way of detecting those conflicts when they happen.  Enter transactions…

The idea behind a purely-optimistic transactional memory model is that all write operations must occur within a transaction.  Data can be read any time, but the memory model will ensure that the data is never in an “intermediate state” - or between writes.  Let’s consider the more complicated scenario of the transfer of funds from one bank account to another.  The steps are as follows:

  1. Withdraw $500 from account A
  2. Deposit $500 into account B

Of course, this isn’t really an algorithm, it’s a high-level overview.  We’re really looking at something a bit more complicated than it would seem:

def transfer(amount: Int, a: Account, b: Account) {
  a.balance = a.balance - amount
  b.balance = b.balance + amount
}

I avoided using the += and -= operators so as to illustrate the problem more explicitly.  This operation has four separate operations on shared mutable state (the account balance).  It’s not too difficult to see how this could go terribly wrong in the case where two separate transfers are taking place simultaneously.  For example, we could transfer $500 from account A into account B, while at the same time we transfer $200 from account B into account C.  Remember, we’re not locking anything, so we run the danger that our concurrent execution order will be interleaved in the following fashion:

  1. [Thread1] Get balance of account A
  2. [Thread1] Set balance of account A to its former amount less 500
  3. [Thread1] Get balance of account B
  4. [Thread2] Get balance of account B
  5. [Thread1] Set balance of account B to its former amount plus 500
  6. [Thread2] Set balance of account B to its former amount less 200
  7. [Thread2] Get balance of account C
  8. [Thread2] Set balance of account C to its former amount plus 200

The operation in red is the one we need to be concerned about.  Thread2 retrieved the balance of account B just prior to when it was modified by Thread1.  This means that when Thread2 calculates the new balance (less 200), it will be basing its result on a now-obsolete balance.  When it sets the balance, the $500 that was transferred from account A will mysteriously vanish, the unfortunate victim of a common race condition.

In a transactional system, both transfers would be separately handled in their own transaction.  Neither of them would modify the other’s data until all operations are completed, at which point the transaction would commit and the data would become “live”.  Thus, when the Thread1 adds $500 to account B, the actual balance of account B will remain constant at its original value (prior to the transaction).  Once the transaction commits, both the balance of A and B will be updated essentially simultaneously.  There will never be a point where $500 is discovered missing “in transit” between two accounts.

This is only half the equation though.  The real genius of the transactional model is just prior to committing, a transaction validates itself, ensuring that all of the data it worked with is still in the same state it was at the beginning of the transaction.  If something has changed, then the transaction is invalidated and must be re-run automatically.  Our issue from above can never happen because the Thread2 transaction will attempt to validate itself, only to discover that the balance of account B has been changed in the meantime.  Rather than yield to the race condition, the Thread2 transaction will throw away its work and start from scratch.  Assuming that nothing else is running, the second validation will be successful (since nothing will have changed concurrently) and the transaction will commit.

It all sounds much more complicated than it actually is.  People familiar with modern databases like Oracle should already be comfortable working with optimistic transactional models.  The technique is a little less common in software, but it can still be applied.

In order to make this work in a conventional application setting, we need to introduce a few more abstractions.  There are several ways to go about this, but I have chosen to follow the model laid down by Rich Hickey in his implementation of Clojure’s STM.  In turn, Clojure seems to take a fair bit of inspiration from Haskell’s STM monad, although it does not port over concepts like transaction composition and what Simon Payton Jones calls “choice”.  Basically, the design can be distilled as follows:

  • Each item of shared state must be stored in a reference
  • References can be read at any point, but they can only be modified within a transaction
  • The data contained within a reference must itself be immutable, the reference simply allows you to switch the data it contains
  • Within a transaction, reading a reference does not return its current value, but rather the value it had precisely when the transaction began.  This allows data to change outside the transaction without disrupting its internal processing.
  • Changes made to a reference inside a transaction are not world-visible until the transaction completes and is committed.  Validation ensures that no data is lost during the commit.
  • Transactions must never contain side-effects as they may be executed multiple times

With all that in mind, let’s get to work on an implementation!

References

Since these are the primitive building blocks of our STM, it seems logical that we should start here.  At an extremely basic level, we will need a structure which looks something like the following:

class Ref[T](private var value: T) {
  def get = value
 
  def :=(value: T) {
    this.value = value
  }
}

This is the basic idea anyway.  We want to be able to retrieve a value from a reference using get, and we want to be able to store a value in the reference using the := operator (hat tip to all of you rabid Pascal fan-boys).  Unfortunately, we haven’t really accomplished anything here.  Yes, we have a reference to wrap around a mutable piece of data, but there are no associated concurrency semantics.  Remember the definition of a reference?  We must only be able to write to it within a transaction.  Furthermore, the value returned from a reference within a transaction is not necessarily its current, world-visible state, but rather the data it had the moment the transaction began.

In order to accommodate these requirements, we will introduce the concept of a context.  Both read and write access to a reference will require a context to be present.  We will have one context for the entirety of world-visible state.  Additionally, each transaction will have its own context.  In that way, we can ensure that transaction modifications are kept local to itself prior to commit while at the same time preventing changes from other transactions from becoming visible after the transaction has started (potentially leading to data integrity problems).  Our API has now evolved to something more like the following:

class Ref[T](private var value: T) {
  // ???
 
  def get(c: Context) = c.retrieve(this)
 
  def :=(value: T)(c: Context) {
    c.store(this)(value)
  }
}

I left the constructor undefined because of the fairly obvious problem in this implementation: how does the Context get the values in the first place?  For the moment, let’s put that problem aside and deal with a more interesting one: limiting write access to within transactions.  Recall that we’re going to have two different kinds of contexts: a live context which is global, as well as a context which is local to the transaction.  The requirement alone implies a way to statically restrict reference mutability to within transactions: require a type of context other than the live one.  To this end, we will derive the following closed hierarchy.  It is closed because there will be no other Context implementations, preventing over-zealous API extensions from fouling up our semantics.

image

We will define Context to be an abstract class and LiveContext to be a singleton object.  Each transaction will have its own Transaction context which it will use in both read and write operations.  LiveContext will only be used outside of a transaction, when there is no other context available.  To enforce this, we will restrict the type of the Context taken by the reference assignment operator to only accept Transaction:

def :=(value: T)(c: Transaction) {
  c.store(this)(value)
}

With this in mind, we can start to envision what the syntax for reference operations will look like:

def transfer(amount: Int, a: Ref[Int], b: Ref[Int])(c: Transaction) {
  a.:=(a.get(c) - amount)(c)
  b.:=(b.get(c) + amount)(c)
}
 
// ...
val accountA = new Ref[Int](1500)
val accountB = new Ref[Int](200)
 
// somehow call `transfer` as a transaction
println("Account A: " + accountA.get(LiveContext))
println("Account B: " + accountB.get(LiveContext))

And just when our design was looking so nice too.  We’ve succeeded in preventing at compile time the modification of references outside of transactions, but we did it at the cost of a tyrannical syntax.  Fortunately, Scala has a handy mechanism for cleaning up syntax such as this, one which should reduce the volume of annoying bulk by several orders of magnitude: implicits.

IScala makes it possible to mark parameters as accepting implicit values.  These parameters in turn become implicit values themselves.  When a method which accepts an implicit parameter of a specific type is called with an implicit value of the same type in scope, the parameter can be omitted entirely.  Thus, by marking LiveContext as an implicit object and appropriately annotating the last parameter of transfer as well as the Context parameters of Ref’s accessor and mutator, we can eliminate almost all of the annoying bulk in the above example:

implicit object LiveContext extends Context {
  ...
}
 
class Ref[T](value: T) {
  // ???
 
  def get(implicit c: Context) = c.retrieve(this)
 
  def :=(value: T)(implicit c: Transfer) {
    c.store(this)(value)
  }
}

With these extra modifiers in place, we can redo our transfer snippet to see how things look.  To cut down on line length, we will also assume that accountA and accountB are references in some sort of global scope:

val accountA = new Ref[Int](1500)
val accountB = new Ref[Int](200)
 
def transfer(amount: Int)(implicit t: Transaction) {
  accountA := accountA.get - amount
  accountB := accountB.get + amount
}
 
// somehow call `transfer` as a transaction
println("Account A: " + accountA.get)
println("Account B: " + accountB.get)

Pretty slick, and we aren’t even finished yet!  We can add an implicit conversion from Ref[T] to T, eliminating the need to call get anywhere in the code:

implicit def refToValue[T](ref: Ref[T])(implicit c: Context) = {
  ref.get(c)
}

…and the final example syntax:

val accountA = new Ref[Int](1500)
val accountB = new Ref[Int](200)
 
def transfer(amount: Int)(implicit t: Transaction) {
  accountA := accountA - amount
  accountB := accountB + amount
}
 
// somehow call `transfer` as a transaction
println("Account A: " + accountA)
println("Account B: " + accountB)

With the exception of the := syntax (which is unfortunately unavoidable), you would never be able to tell that references are being used rather than conventional vars.  Even better, we have managed to preserve our static assurance that no reference may be modified outside of a transaction.  If we were to attempt to call the := method without an instance of Transaction on-hand, the Scala type checker would complain and our code would not compile.  The we need to do to close the loop is to make sure that the only place a Transaction instance can be obtained is inside a transaction (seems logical).

Atomic

For the moment, let’s shelve all of the implementation requirements for the STM and instead focus on the API.  We already have an almost elegant syntax for references, but we’re still missing one final piece: initiating the transaction.  Being incredibly imaginative and innately creative, I decided the best way to devise an API for this would be to hit Wikipedia.  After all, why come up with something original when someone smarter has already solved the problem?

 // Insert a node into a doubly-linked list atomically
 atomic {
     newNode->prev = node;
     newNode->next = node->next;
     node->next->prev = newNode;
     node->next = newNode;
 }

 atomic (queueSize > 0) {
     remove item from queue and use it
 }

Credit: Software transactional memory # Proposed language support

It’s a fairly straightforward API, one which could be expressed in Scala as it stands if we really wanted to.  Obviously the body of the first transaction is a little closer to C than Scala, but we can just take that as example pseudo-code and move on.  The real core of the API is a single method, atomic, which takes a function value and executes that function as a transaction.  If a conditional is provided, it acts as a guard.  If the guard ever returns false, the transaction aborts and is not retried.

The only problem here is we haven’t accounted for our implicit Transaction parameter.  Intuitively, we could just add a parameter of type Transaction to the function value, but unfortunately Scala doesn’t allow anonymous functions with implicit parameters.  That leaves one of two options: either take the Transaction as a parameter to the anonymous function and then store it in an implicit value within the function; or pass an actual method which takes an implicit parameter.  In truth, either approach will work, but for the remainder of the article I will use the method-passing approach, rather than the separate assignment within the anonymous method.

Altogether, our API in action looks something like this:

val accountA = new Ref[Int](1500)
val accountB = new Ref[Int](200)
 
def transfer(amount: Int)(implicit t: Transaction) {
  accountA := accountA - amount
  accountB := accountB + amount
}
 
atomic(transfer(500)(_))    // run transaction
 
println("Account A: " + accountA)
println("Account B: " + accountB)

Notice the extra (_) syntax in the call to atomic.  (thanks, Jesper!)  This is required because transfer accepts an implicit parameter.  Without it, Scala doesn’t know whether we mean to call the method using some implicit value in scope or if we want the function value itself.

Remember this API.  We’ll come back to the implementation of this method after we have completed the rest of the framework.  For now, let’s move onto context…

Context

Returning to the dark and mysterious internals of the implementation, we now come to the deep morass of Context.  As it turns out, everything in the STM will revolve around this class and its two separate implementations.  Recall that it is responsible for retrieving reference values, controlling what data is visible from within a transaction and what data is live.  Generally speaking, we are going to have the following design with respect to where reference data is handled:

  • If in LiveContext, access the One True copy of the data.
  • If in Transaction, access the One True copy if and only if there is no transaction-local version.  Once the reference has been read, cache the data and return that value from all future reads within the transaction.  We’re cheating a bit here since we aren’t taking a snapshot of the world on transaction start, we’re waiting for the first reference read.  As long as we get our validation right, this should be ok.

Logically, this is a simple default operation clouded by a fairly substantive special case (within a transaction after the reference has been read or written once).  To handle the special case, it only makes sense that we use a map from reference to the data to which it corresponds in that transaction.  Remember that the data within a transaction may be a bit behind the live copy.

We could use another map for the general case, but we can make things even simpler than that.  Rather than having a global map from reference to value, we can just store the live values within the Ref objects themselves.  They will still need to delegate to their context to retrieve and store that value, but the actual live version can be kept locally.  This simplifies LiveContext tremendously:

class Ref[T](private[stm] var value: T) {
  def get(implicit c: Context) = c.retrieve(this)
 
  def :=(value: T)(implicit t: Transaction) {
    t.store(this)(value)
  }
}
 
// ...
sealed abstract class Context {
  def retrieve[T](ref: Ref[T]): T
}
 
implicit object LiveContext extends Context {
  def retrieve[T](ref: Ref[T]) = ref.value
}

In this iteration, we have simplified Context a little bit, seeing as it is never actually used for storage.  Notice that all we need to do in LiveContext is delegate right back to the reference.  This may seem like a bit of superfluous indirection but it becomes absolutely essential once we start considering Transaction:

import collection._
 
final class Transaction private[stm] () extends Context {
  private val world = mutable.Map[Ref[Any], Any]()
 
  def retrieve[T](ref: Ref[T]) = {
    val castRef = ref.asInstanceOf[Ref[Any]]
 
    if (!world.contains(castRef)) {
      world(castRef) = ref.value
    }
 
 
 
 
 
    world(castRef).asInstanceOf[T]
  }
 
  def store[T](ref: Ref[T])(value: T) {
    val castRef = ref.asInstanceOf[Ref[Any]]
    world(castRef) = value
  }
}

This is the reason we needed to redirect through retrieve rather than just grabbing the value directly in Ref.  Within a transaction, any dereferencing will polymorphically come to an instance of this class.  The mutable map, world, handles the transaction-local cache of all values once they have been accessed.  Thus, the reference can change after we have looked at it (when another transaction commits) and it doesn’t affect the values local to our transaction.  This technique is exceedingly powerful and in no small part responsible for the higher throughput made possible by the transactional model.

Incidentally, it is worth noting that within a transaction, references are not thread safe.  Thus, if you start a transaction and then start manipulating references concurrently within that same transaction, bad things will happen.  This isn’t really a problem though because transactions are always designed to be single-threaded from start to finish.  They are used in multi-threaded situations, they do not use multiple threads.

Commitment

Now that we have the basic nuts and bolts of our STM framework, we need to start considering how we are going to commit transactions.  The process is two fold: first, we validate all references either read or written to by the transaction, checking for anything which may have changed in the interim; and second, we copy all new data from our transaction-local cache into the live references.  For the sake of simplicity, we will only allow one transaction to commit at a time.  We could do a little better, but this should work just fine for the experimental stuff.

Validation is a toughy, but the second commit step is fairly easy to satisfy: just loop through a set of all writes and copy the changes into the corresponding Ref.  The necessary changes are as follows:

final class Transaction private[stm] () extends Context {
  private val world = mutable.Map[Ref[Any], Any]()
  private val writes = mutable.Set[Ref[Any]]()
 
  ...
 
  def store[T](ref: Ref[T])(v: T) {
    val castRef = ref.asInstanceOf[Ref[Any]]
 
    world(castRef) = v
    writes += castRef
  }
 
  def commit() = {
    CommitLock.synchronized {
      // TODO  validate
 
      for (ref <- writes) {
        ref.value = world(ref)
      }
    }
 
    true
  }
}
 
private[stm] object CommitLock

The commit method returns a Boolean value indicating whether or not the commit was successful.  If a conflict was detected the method returns false and (presumably) the transaction will be tried again.  Since we aren’t doing any validation just yet, we will always return true.  Note that we don’t really have to lock anything related to the Ref instances in order to write their data.  It is possible for another thread to be reading data from these same references at precisely the same time as we are writing to them.  However, primitive memory operations are atomic by definition, meaning that we don’t need to worry about data integrity on the level of a reference value.

Validation is actually the most important part of the transaction commit process and quite possibly the most important facet of the entire STM concept.  Without it, there is nothing to prevent data integrity from breaking down and causing problems.  (remember that $500 we lost?)  Unfortunately, our system doesn’t quite have the chops yet to support any sort of transaction validation.

In order to validate a Ref, we need to compare its state to the state it was in at the moment we read from it or wrote to it (meaning into the transaction-local cache).  We can’t just compare values, partially because equals isn’t fast enough, but also because it doesn’t provide a strong enough guarantee about whether or not data has changed.  What we need is some value which we control which indicates deterministically the current state of a Ref and which can be used later to determine if that state has changed.  In short, we need a revision number:

class Ref[T](value: T) {
  private[stm] var contents = (value, 0)    // init revision to 0
 
 
}

The revision numbers have to be controlled statically with thread-safety, so the best place for them should be in the Transaction singleton.  Transaction (the companion object for Transaction the Context) also contains our implicit conversion as well as that enigmatic atomic method that we still have yet to implement.

object Transaction {
  private var rev_ = 1
  private val revLock = new AnyRef
 
  private def rev = revLock.synchronized {
    val back = rev_
    rev_ += 1
    back
  }
 
  ...
}

Empowered by a revision increment system which is guaranteed to produce unique values for each invocation, we can expand upon our concept just a little bit.  Each Transaction will have a unique revision number associated with it (maintained within the rev field).  Assuming a transaction successfully commits, it will not only modify the value of the references in question but also their revision, which it will set to its own number.

This revision system can be used in validation of transaction commit.  Whenever we read or write to a Ref for the first time, we will store its current revision number within the transaction.  When it comes time to commit the transaction, we can loop over our revision map and compare with the actual revision of the reference in question.  If all of the expected revisions match up with reality, the transaction checks out and we can go ahead and commit.  Otherwise, we have to assume that a transaction operating concurrently modified a reference we used and committed after we started our own transaction.  Once this is known, we can’t simply commit over the other transaction’s changes, throwing away all of that money.  Our transaction must be retried from scratch.

Now that we know how to validate, we can finally look at a completed version of commit (and supporting cast):

final class Transaction private[stm] (val rev: Int) extends Context {
  private val world = mutable.Map[Ref[Any], Any]()
  private val writes = mutable.Set[Ref[Any]]()
  private val version = mutable.Map[Ref[Any], Int]()
 
 
  def retrieve[T](ref: Ref[T]) = {
    ...
 
    if (!world.contains(castRef)) {
      ...
 
      if (!version.contains(castRef)) {
        version(castRef) = castRef.contents._2
      }
    }
 
    ...
  }
 
  def store[T](ref: Ref[T])(v: T) {
    ...
 
    if (!version.contains(castRef)) {
      version(castRef) = ref.contents._2
    }
 
    ...
  }
 
  def commit() = {
    CommitLock.synchronized {
      val back = world.foldLeft(true) { (success, tuple) =>
        val (ref, _) = tuple
        success && ref.contents._2 == version(ref)
      }
 
      if (back) {
        for (ref <- writes) {
          ref.contents = (world(ref), rev)
        }
      }
 
      back
    }
  }
}

It’s a lot of code, but all fairly straightforward.  The validation simply bears out our intuition: checking the revisions initially retrieved from the references with their current values.  Here again we are making use of the fact that memory access is atomic.  We don’t need to worry about a revision changing out of sync with a value because both of them are encapsulated by a 2-tuple within the Ref itself.  Meanwhile, the validation can be trusted because of the CommitLock: we don’t need to worry about another transaction committing between our validation and when we actually get around to saving our values.

Atomic 2.0

I said we would come back to this, and here we are!  We never did implement the atomic method, which is a bit of a shame seeing as it is what is responsible for kicking off the entire transactional process.  Not only that, but it creates the Transaction instance, ensures that the transaction gets committed once it has finished and it retries if that commit fails.  Set in code, it looks something like this:

def atomic[A](f: (Transaction)=>A): A = atomic(true)(f)
 
def atomic[A](cond: =>Boolean)(f: (Transaction)=>A) = {
  def attemptTransact(): A = {
    if (cond) {
      val trans = new Transaction(rev)
 
      try {
        val result = f(trans)
 
        if (trans.commit()) result else attemptTransact()
      } catch {
        case _ => attemptTransact()    // if exception, assume conflict and retry
      }
    } else null.asInstanceOf[A]
  }
 
  attemptTransact()
}

The only weird thing here is the use of an internal function to control the transaction dispatch process.  This is necessary because we need to emulate the pattern of a do/while loop without losing the ability to capture a return value.  This is the one minor feature of the transaction API that we have designed which I haven’t already discussed: the ability to return a value from a transaction.  Practically speaking, this isn’t needed too often since the very purpose of a transaction is to modify references, but it is still a pattern worth keeping in hand.

You will notice that we have some generic catch-all exception handling going on here.  Whenever a transaction throws an exception, we assume that it has failed and we try again.  To be honest, I wrestled back and forth with this decision.  After all, if a transaction comes across a NullPointerException on its first try, it’s not likely to do any better the second time around, or the third, or the fourth, or the…  On the other hand, there is a remote but very real possibility that data can briefly get into an inconsistent state within a transaction.  To understand how, consider the following abstract scenario:

  1. Transaction A writes data to reference A and reference B
  2. Transaction B starts after transaction A but before it has committed
  3. Transaction B reads from reference A
  4. Transaction A commits
  5. Transaction B reads from reference B
  6. Transaction B does the funky chicken and tries to commit
  7. Transaction B fails validation and has to try again
  8. Everything works fine on the second time around

Everything gets straightened out in the end due to the validation, but there is still this hairy moment within transaction B that we have to worry about.  Within the transaction, we have some data from before transaction A committed, and some from afterward.  Validation is never going to let this slip by, but while the transaction is still executing we can run into some fairly serious issues.

Imagine for example that reference B is changed in such a way that it throws an exception in transaction B unless it is paired with just the right value of reference A.  Since we have an inconsistent pairing between references A and B within transaction B, we will get an exception caused directly by a breakdown in data integrity.  Because this is a data integrity issue, we want this exception to trigger a redo in the transaction.  However, we can’t really predict such exceptions, so there’s no way we can distinguish between a legitimate exception (which should be propagated outside the transaction) and a data integrity fault.  In short, we’re in trouble.

Given the way we designed the STM framework, I don’t really see a nice, efficient way to avoid this problem.  There are other approaches we could take which don’t have this issue, but that would either require an entirely different implementation or a much smarter developer taking the lead.  I would be interested to see how Clojure handles this case…

A Less-Trivial Example

Now that we have this full STM framework, it might be interesting to put it to work.  To that end, let’s create a simple market simulation with three businesses and a hundred potential customers, each with their own accounts and separate balances.  All of these entities will have their own thread operating on their behalf, making purchases, excepting refunds and keeping the Mafia off their back.  Bear in mind that this is a truly concurrent simulation with full thread semantics.  We’re not going to use actors or anything like that to reduce the number of threads involved, everything will operate in true parallel.

To make things a little more interesting, we will also associate a 7.5 % fee with each transfer, paid to a separate account.  Also, each transfer will be logged presumably for later review.  To make everything fun, we will have one final thread which monitors the entire market, summing up every account and checking the total.  The obvious concern is that some collision in access of the shared data will lead to the unexpected loss (or gain) of wealth.  So long as the total market value remains constant, we can assume that shared state is being handled appropriately.

Let’s start out by defining our accounts and basic structure.  Bear in mind that Account is a type alias (similar to a C-style typedef) for Ref[Long], it is not a separate class.

object BankFrenzy {
  import Transaction._
 
  type Account = Ref[Long]
 
  private val fees = new Account
  private val log: Ref[Vector[Transfer]] = new Ref(EmptyVector)
 
  def main(args: Array[String]) {
    val business1 = new Account(15000)
    val business2 = new Account(20000)
    val business3 = new Account(50000)
 
    val people = (0 until 100).foldLeft(Vector[Account]()) { (vec, i) =>
      vec + new Account(1000)
    }
  }
}

Remember that all data contained within a reference must be completely immutable, otherwise the STM framework cannot help you.  If you can just change the underlying data within a reference at will, then transaction semantics are useless (since revision tracking breaks down).  To that end, we will use my port of Clojure’s persistent vector to handle the transfer log; and just because it’s fun, we will also use it to manage personal accounts.

Moving on, we should probably define some functions to operate on these accounts.  Remember, they must take an implicit parameter of type Transaction, otherwise they will be prevented from modifying references.

def transfer(amount: Long, from: Account, to: Account)(implicit t: Transaction) {
  log := log + Transfer(amount, from, to)
 
  val less = Math.round(amount * 0.075)
 
  from := from - amount
  to := to + (amount - less)
  fees := fees + less
}
 
def sum(portfolio: Vector[Account])(implicit t: Transaction) = {
  portfolio.foldRight(0:Long) { _ + _ }
}

One thing worthy of attention here is the sum method: we aren’t actually modifying any references in this method, so why bother putting it within a transaction?  The answer is to enforce data integrity.  We want to make sure that we see a truly consistent picture of the entire market, and the only way to be absolutely sure of that is to use the conceptual “snapshot of the world” maintained by a transactional log.

Also notice that the sum method does not store its result within a reference, it actually returns a value.  This is one of the neat features of our STM implementation: it allows transactions to return values just like functions.  This dramatically reduces the boilerplate which would normally be required to get the result of a calculation from within a transaction.

With this infrastructure in place, we can go ahead and create all of the threads we’re going to need for the simulation:

val market = people + business1 + business2 + business3 + fees
var running = true
 
val secActor = thread {
  while (running) {
    val total = atomic(sum(market)(_))
    println("Market value: $" + total)
 
    sleep(10)
  }
}
 
val businessActor = thread {
  while (running) {
    atomic(transfer(250, business1, business2)(_))    // transfer rent
 
    sleep(200)
  }
}
 
val peopleActors = for {
  i <- 0 until people.length
  val p = people(i)
} yield thread {
  atomic(transfer(50, p, business3)(_))       // payoff the mob
  atomic(transfer(i * 10, p, business1)(_))   // purchase from business1
  atomic(transfer(i * 3, business2, p)(_))    // refund from business2
}

This is assuming that we have already defined a utility method, thread, in the following fashion:

def thread(f: =>Unit) = new Thread {
  override def run() {
    f
  }
}

Just one of the many little tricks made possible by the staggering power of Scala’s syntax.

There isn’t much worthy of attention within our simulation threads, it’s just a lot of concurrent operations running against some shared state.  If we wanted to really have some fun, we could add println status messages to each thread, allowing us to try the simulation multiple times and watch the thread interleaving change from run to run.  However, all we’re really interested in with this simulation is the assurance that data integrity is maintained at all times.  To see that, all we really need is to check the starting market value, the ending value and some period market auditing while the simulation is in progress:

println("Starting market value: $" + atomic(sum(market)(_)))
 
businessActor.start()
secActor.start()
 
for (pa <- peopleActors) pa.start()
for (pa <- peopleActors) pa.join()
running = false
 
businessActor.join()
secActor.join()
 
println("Total fees: $" + fees)
println("Final market value: $" + atomic(sum(market)(_)))

If we compile and run the simulation, the output will look something like this (on a dual-core, 2 Ghz processor):

  Starting market value: $185000
  Market value: $185000
  Market value: $185000
  Market value: $185000
  Total fees: $5258
  Final market value: $185000

We could try the simulation a hundred times with different CPU loads and even on separate machines, the results will always be the same.  While there may be a greater or lesser number of concurrent market audits during the simulation, the values retrieved each time will be the same.  From this we can conclude one important fact: we have succeeded in designing an STM framework which preserves data integrity.

In simulations like this one with a high degree of contested data, STM may actually be slower than a traditional, fine-grained locking strategy.  However, just think for a moment about trying to write this simulation using Java’s ReentrantReadWriteLock class.  It would be nearly impossible to design such a system, let alone maintain it.  There would always be the danger that we would accidentally get the locking in the wrong order, or forget to lock something before we access it.  In short, such an effort would be extremely hazard prone, and far more verbose.  Using our STM framework, the resulting code was clean and simple to understand.  It’s easy to see why techniques like this are really starting to catch on.

Conclusion

Hopefully this has been an informative and enjoyable foray into the world of transactional memory systems.  I will retroactively apologize for any areas where my facts are in err; I’m certainly quite new to all of these concepts.  As per usual, the library described in this article is available for download.  I was a little lax on the testing side of life, so you may want to vet the library a little bit before you trust it unsupervised in your data center.

There are a number of interesting aspects to STM that I didn’t cover in this article, such as retry and the related function, check (implemented in the framework).  Also left completely untouched is the monadic operation orElse which can be used to compose transactions in a “first try this, then try that on failure” sort of way.

STM is a very active research topic today with a lot of the brightest minds in the industry pondering ways to make it better.  While it certainly doesn’t solve all of the problems associated with concurrency, it does have the potential to simplify locking and produce better performance under some conditions.  Definately a technology to watch as it develops!

Higher-Order Fork/Join Operators

22
Sep
2008

I think we can all agree that concurrency is a problem.  Not really a problem as in “lets get rid of it”, but more the type of problem that really smart people spend their entire lives trying to solve.  Over the years, many different solutions have been proposed, some of them low-level, some more abstract.  However, despite their differences, a common thread runs through all of these ideas: each of them attempts to ease the pain of decomposing operations in a reorderable fashion.

Surprisingly, the word “ordering” is not often heard in conjunction with parallelism.  Most of the time, people are thinking in terms of server/client or broker/investor.  If you really deconstruct the issue though, there is actually a deeper question underlying all concurrency: what operations do not depend upon each other in a sequential fashion?  As soon as we identify these critical operations, we’re one step closer to being able to effectively optimize a particular algorithm with respect to asynchronous processing.

By the way, I really will get to fork/join a little later, but I wanted to be sure that I had laid a solid groundwork for the road ahead.  Without understanding some of the fundamental theory behind fork/join, it will be impossible to see how it can be applied effectively to your next project.

Factorial

One of the odd things about computer science is a depressing lack of imaginative examples.  Not being one to break with tradition, I’ve decided to kick off our little quest with a little time spent in the well-trodden foothills of factorial.  This should help us to establish some terminology (which I’m arbitrarily assigning for the purposes of this article) as well as the basic concepts involved.  A simple recursive implementation (in Scala of course) would look like this:

def fac(n: Int): Int = if (n < 1) 1 else fac(n - 1) * n

For each number, this function performs a number of discrete operations.  First, it checks to see if the index is less than 1.  If so, then the function returns immediately, otherwise it proceeds on a separate course.  This is a branching operation.  Since the “true branch” is uninteresting, we will focus on the case when the index is in fact greater than 1.  In this case, we have three critical operations which must be performed.  They are as follows (temporary names are fictitious):

  • Subtract 1 from n and store the value in some temporary $t1
  • Dispatch to function fac passing the value from $t1
  • Multiply result from dispatch with n and return

All this may seem extremely pedantic but please, bear with me.  Consider these operations very carefully in the topological sense.  What we’re trying to see here is if one (or more) of these operations may be ordered above (or below) one of the others.  For example, could we perhaps dispatch to fac after multiplying and returning?  Or could we perform the subtraction operation after the dispatch?

The answer is quite obviously “of course not”.  There is no way we can change the ordering in this expression because each step depends entirely upon the result from the previous.  As far as our attempts to parallelize are concerned, these three operations are completely atomic, meaning that they form an inseparable computation.

image

Since we’ve drilled down as far as we can possibly go in our implementation and so identified the most atomic computation, let’s move out one step and see if we can find anything with promise.  Stepping back through our execution sequence leads us directly to the branching operation identified previously.  Remember that our goal is to identify operations which can be shuffled around in the execution order without affecting the semantics. (does this feel like pipeline optimization to anyone else?)  Unfortunately, here too we are at an impasse.  We might try moving an atomic computation from one of the branches out before the branching operation, but then we could conceivably do the wrong thing.  Since our function uses recursion, this sort of reordering would be very dangerous indeed. 

The truth is that for factorial, there are absolutely no operations which can be moved around without something going wrong.  Because of this property, we are forced to conclude that the entire factorial operation is atomic, not just its false branch.  Unfortunately, this means that there is no way to effectively transform this function into some sort of asynchronous variant.  That’s not to say that you couldn’t calculate factorial of two separate numbers concurrently, but there is no way to modify this implementation of the factorial function in a parallel fashion1.  This is truly the defining factor of atomic computations: it may be possible to reorder a series of atomic computations, but such a reordering cannot affect the internals of these computations.  Within the “atom”, the order is fixed.

So what does reordering have to do with concurrency?  Everything, as it turns out.  In order to implement an asynchronous algorithm, it is necessary to identify the parts of the algorithm which can be executed in parallel.  In order for one computation to be executed concurrently with another, neither must rely upon the other being at any particular stage in its evaluation.  That is to say, in order to execute computation A at the same time as computation B, the ordering of these two computations must be irrelevant.  Providing that both computations complete prior to some computation C (which presumably depends upon the results of A and B), the aggregate semantics of the algorithm should remain unaffected.  You could prove this, but I really don’t feel like it and frankly, I don’t think anyone reading this will care.  :-)

Fibonacci

Now that we have some simple analysis on factorial under our belt, let’s try something a little tougher.  The Fibonacci series is another of those classic computer science examples.  Curiously enough, the implementation used by every known textbook to explain recursion is actually one of the worst possible ways to implement the calculation.  Wikipedia has an excellent description of why this is, but suffice it to say that the intuitive approach is very, very bad (efficiency wise).

However, the “good” implementations used to calculate the nth number of the Fibonacci series just aren’t as concise or easily recognized.  Also, they’re fairly efficient in their own rights and thus see far less benefit from parallelization at the end of the day.  So rather than taking the high road, we’re going to just bull straight ahead and use the first algorithm which comes to mind:

def fib(n: Int): Int = if (n < 2) n else fib(n - 1) + fib(n - 2)

Like factorial, this function makes an excellent poster child for the syntactic wonders of functional programming.  Despite its big-O properties, one cannot help but stop and appreciate the concise beauty of this single line of code.

As is common in simple recursion, our function begins with a conditional.  We have a simple branching operation testing once again for a range (n < 2), with a base case returning n directly.  It is easy to see how the “true branch” is atomic as it consists of but one operation.  We’ve already made a hand-wavy argument that branches themselves should not be dissected and moved around outside of the conditional, so it would seem that our only hope rests with the recursive “false branch”.  In words, we have the following operations:

  • Subtract 1 from n and store the value in temporary $t1
  • Dispatch to function fib passing the value from $t1; store the value in $t1
  • Subtract 2 from n and store the value in temporary $t2
  • Dispatch to function fib passing the value from $t2; store the value in $t2
  • Add values $t1 and $t2 and return

Ah, this looks promising!  We have two “blocks” of operations which look almost identical.  Printed redundancy should always be a red flag to developers, regardless of the form.  Printed redundancy should always be a red flag to developers, regardless of the form.  In this case though, we don’t want to extract the duplicate functionality into a separate function, that would be absurd.  Rather, we need to observe something about these two operations, specifically: they do not depend on one-another.  It doesn’t matter whether or not we have already computed the value of fib(n - 1), we can still go ahead and compute fib(n - 2) and the result will be exactly the same.  We’re going to get into trouble again as soon as we get to the addition operation, but as long as both dispatches occur before the final aggregation of results, we should be in the clear!

image

Because it does not matter in which order these computations occur, we are able to safely parallelize without fear of subtle semantic errors cropping up at unexpected (and of course, unrepeatable) full-board demonstrations.  Armed with the assurance which only comes from heady, unrealistic trivial algorithm analysis, we can start planning our attack.

Threads Considered Insane

Being a good Java developer (despite the fact that we’re using Scala), the very first thing which should come to mind when thinking of concurrency is the concept of a “thread”.  I’m not going to go into any detail as to what threads are or how they work since they really are concurrency 101.  Suffice it to say though that threads are the absolute lowest-level mechanism we could possibly use (at least on this platform).  Here we are, Fibonacci a-la Thread:

def fib(n: Int): Int = {
  if (n < 2) n else {
    var t1 = 0
    var t2 = 0
 
    val thread1 = new Thread {
      override def run() {
        t1 = fib(n - 1)
      }
    }
 
    val thread2 = new Thread {
      override def run() {
        t2 = fib(n - 2)
      }
    }
 
    thread1.start()
    thread2.start()
 
    thread1.join()
    thread2.join()
 
    t1 + t2
  }
}

I can’t even begin to count all of the things that are wrong with this code.  For starters, it’s ugly.  Gone is that attractive one-liner that compelled us to pause and marvel.  In its place we have a 25 line monster with no apparent virtues.  The intent of the algorithm has been completely obscured, lost in a maze of ceremony.  But the worst flaw of all is the fact that this design will actually require (n - 2)! threads.  So to calculate the 10th Fibonacci number, we will need to create, start and destroy 40,320 Thread instances!  That is a truly frightening value.

At first blush, it seems that we can alleviate at least some of the insanity by using a thread pool.  After all, can’t we just reuse some of these threads rather than throwing them away each time?  Unfortunately, this well-intentioned approach doesn’t quite suffice.  It turns out that we can’t really pool very many threads due to the fact that we’re utilizing a thread in fib to recursively call itself and then wait for the result.  Thus, the “parent” dispatch is still holding a resource when the “child” attempts to obtain an allocation.  Granted, we have reduced the number of required threads to a mere 2n - 4, but with a fixed size thread pool (the most common configuration), we’re still going to run into starvation almost immediately.  Apocalisp has a more in-depth article explaining why this is the case.

Something a Little Better…

For the moment, it looks like we have run into an insurmountable obstacle.  Rather than mash our brains out trying to come up with a solution, let’s move on and conceptualize how we might want things to work, at least in syntax.

Clearly threads are not the answer.  A better approach might be to deal with computational values using indirection.  If we could somehow kick off a task asynchronously and then keep a “pointer” to the final result of that task (which has not yet completed), we could later come back to that result and retrieve it, blocking only if necessary.  It just so happens that the Java 5 Concurrency API introduced a series of classes which fulfill this wish:  (what a coincidence!)

def fib(n: Int): Future[Int] = {
  if (n < 2) future(n) else {
    val t1 = future {
      fib(n - 1).get()
    }
 
    val t2 = future {
      fib(n - 2).get()
    }
 
    future {
      t1.get() + t2.get()
    }
  }
}
 
def future[A](f: =>A) = exec.submit(new Callable[A] {
  def call = f
})

I’m assuming that a variable called exec is defined within the enclosing scope and is of type ExecutorService.  The helper method is just syntax, the real essence of the example is what we’re doing with Future.  You’ll notice that this is much shorter than our threaded version.  It still bears a passing resemblance to that horrific creature of yesteryear, but yet remains far enough removed as to be legible.  We still have our issue of thread starvation to content with, but at least the syntax is getting better.

Along those lines, we should begin to notice a pattern emerging from the chaos: in both implementations so far we have started by asynchronously computing two values which are assigned to their respective variables, we then block and then merge the result via addition.  Do you see the commonality?  We start by forking our reorderable computations and finish by joining the results according to some function.  This right here is the very essence of fork/join.  If you understand this one concept, then everything else falls into place.

Now that we have identified a common pattern, we can work to make it more syntactically palatable.  If indeed fork/join is all about merging asynchronous computations based on a given function, then we can invent a bit of syntax sugar which should make the Fibonacci function more concise and more readable.  To differentiate ourselves from Future, we will call our result “Promise” (catchy, ain’t it?).

def fib(n: Int): Promise[Int] = {
  if (n < 2) promise(n) else {
    { (_:Int) + (_:Int) } =<< fib(n - 1) =<< fib(n - 2)
  }
}

At first glance, it seems that all we have done is reduce a formerly-comprehensible series of verbose constructs to a very concise (but unreadable) equivalent.  We can still make out our recursive calls as well as the construction of the base case, but our comprehension stops there.  Perhaps this would be a bit more understandable:

val add = { (a: Int, b: Int) => a + b }
 
def fib(n: Int): Promise[Int] = {
  if (n < 2) promise(n) else {
    add =<< fib(n - 1) =<< fib(n - 2)
  }
}

The only reason to use an anonymous method assigned to a value (add) rather than a first-class method is the Scala compiler treats the two differently in subtle ways.  Technically, I could use a method and arrive at the same semantic outcome, but we would need a little more syntax to make it happen (specifically, an underscore).

This should be starting to make some more sense.  What we have here is a literal expression of fork/join: given a function which can join two integers, fork a concurrent “process” (not in the literal sense) for each argument and reduce.  The final result of the expression is a new instance of Promise.  As with Future, this operation is non-blocking and very fast.  Since the arguments themselves are passed in as instances of Promise, we literally don’t need to wait for anything.  We have now successfully transformed our original fib function into a non-blocking version.  The only thing left is a little bit of syntax to “unwrap” the final result:

val num = fib(20)
num()                  // 6765

Incidentally, the =<< operator was not chosen arbitrarily, its resemblance to the “bind” operator in Haskell is quite intentional.  That is not to say that the operation itself is monadic in any sense of the word, but it does bear a conceptual relation to the idea of “binding multiple operations together”.  The operator is inverted because the bind operation is effectively happening in reverse.  Rather than starting with a monad value and then successively binding with other monads and finally mapping on the tail value (as Scala does it), we are starting with the map function and then working our way “backwards” from the tail to the head (as it were).  None of the monadic laws apply, but this particular concurrency abstraction should tickle the same region of your brain.

An End to Starvation

I half-promised a little while back that we would eventually solve the issue of thread starvation in the implementation.  As mentioned, this particular issue was the central focus of an article on Apocalisp a few weeks back.  For full details, I will again refer you to the original.  In a nutshell though, it looks like this:

  • Operation A dispatches operations B and C, instructing them to send the result back to A once they are finished
  • Operation A releases the thread
  • Operation B executes and sends the result back to A
  • Operation C executes and sends the result back to A
  • Operation A combines the two results and sends the final result back to whoever dispatched it
  • …and so on, recursively

Rather than stopping the world (or at least, our little thread) while we wait for a sub-operation to complete, we just tell it to give us the result as soon as it’s done and we move on.  The whole thing is based around the idea of asynchronous message passing.  The first person to say “actors” gets a gold star.

Every Promise is an actor, capable of evaluating its calculation and sending the result wherever we need it.  The =<< builds up a “partially-applied asynchronous function” based on the original function value we specified (add), binding each Promise in turn to a successive argument (a nice side-benefit of this is compile-time type checking for argument binding).  Once the final argument is bound, a full-fledged Promise emerges with the ability to receive result messages from the argument Promise(s).  Once every value is available, the results are aggregated in a single collection and then passed in order to the function.  The final result is returned and subsequently passed back to any pending actors.  It’s a classic actor pattern actually: don’t block, just tell someone else to call you as soon as they are ready.

With this strategy, it is actually possible to execute the whole shebang in a single thread!  This is because we never actually need to be executing anything in parallel, everything is based on the queuing of messages.  Of course, a single-threaded execution model would completely ruin the entire exercise, so we will just trust that Scala’s actor library will choose the correct size for its internal thread pool and distribute tasks accordingly.

Conclusion

In case you hadn’t already guessed, I’ve actually gone and implemented this idea.  What I have presented here is a bit of a distillation of the “long think” I’ve had regarding this concept and how it could be done.  The only important item that I’ve left out is what Doug Lea calls the “granularity control”.  Basically, it’s a threshold which describes the point at which the benefits of executing a task asynchronously (using fork/join) are outweighed by the overhead involved.  This threshold can be seen in my benchmark of the library.  Performance numbers look something like this (on a dual-core, 2 Ghz 32-bit processor):

Calculate fib(45)
Sequential Time 14682.403901 ms
Parallel Time 7515.882423 ms
Sequential Memory 0.023438 KB
Parallel Memory 3131.548828 KB

For the mathematically challenged, the results show that the parallel execution using Promise was 95.351698% faster than the same operation run sequentially.  That’s almost linear with the number of cores!  Accounting for the overhead imposed by actors, I would expect that the impact on performance would approach linearity as the number of cores increases.

Fork/join isn’t the answer to the worlds concurrency problems, but it certainly is a step in the right direction.  Also, it’s hardly a new approach, but it has generally remained shrouded in a murky cloud of academic trappings and formalisms.  As time progresses and our industry-wide quest for better concurrency becomes all the more urgent, I hope that we will begin to see more experimentation into improving the outward appearance of this powerful design pattern.

More Persistent Vectors: Performance Analysis

1
Sep
2008

In my previous post, I introduced the concept of "persistent vectors" and walked through one implementation of the idea.  When I actually pushed the post out, I was pretty happy with my code, but it seems I still have much to learn.  :-)  A number of very smart people replied, suggesting ways that the implementation could be cleaned up and improved.  Among these intrepid commenters was David MacIver, who correctly pointed out the similarities between my persistent vector and his IntMap class (coming in Scala 2.7.2).  Needless to say, my interest was piqued, so over the course of last week, I spent a fair amount of time going over his implementation as well as the implementations proposed by researchers in the past.

I also took the time to create a proper performance test suite for my vector implementation, one which could compare with other conceptually-similar implementations in a controlled and repeatable environment.  The results were both interesting and instructive, so I figured I would take the time to share them here.

Essentially, the performance suite runs through six tests, each of which designed to illuminate either a strength or a weakness in the underlying implementation.  These tests are run against four different classes:

  • Vector (my persistent implementation)
  • ArrayBuffer (from scala.collection.mutable)
  • IntMap (David MacIver)
  • Map

The addition of the last test target was more curiosity than anything else.  I wanted to see just how improved was IntMap over Map for integer keys.  The results turned out to be rather surprising:

  Time Memory
  Vector ArrayBuffer IntMap Map Vector ArrayBuffer IntMap Map
Fill Sequential 190.51ms 15.39ms 37.15ms 115.14ms 67.11 MB 3.93 MB 22.29 MB 12 MB
Fill Random 392.98ms 2028.43ms 128.35ms 103.3ms 64.97 MB 513.59 MB 39.89 MB 10.94 MB
Read Sequential 28.01ms 3.83ms 23.21ms 35.24ms 6.67 MB 0.02 KB 0.02 KB 3.37 MB
Read Random 92.8ms 11.14ms 54.81ms 63.8ms 8.01 MB 0.02 KB 0.02 KB 2.01 MB
Reverse 0.02ms 0.01ms - - 0.09 KB 0.04 KB - -
Compute Length 0.01ms 0.01ms 5.11ms 0.3ms 0.02 KB 0.02 KB 0.02 KB 2.23 MB

As you can see, IntMap triumphed over the other immutable data structures (including Vector) at just about every turn.  To understand why this is surprising, we need to spend a little time examining the theoretical properties of the two primary implementations: IntMap and Vector.

Patricia Tries

I already spent a fair bit of time explaining the concept of partitioned tries in the previous article, so I’m not going to reiterate all of that information here.  In a nutshell, the implementation of Vector is based upon the concept of a trie with an extremely high branching factor where the path to each node encodes its index.  Unlike List, the data structure is not fully persistent, meaning that some data copying must take place upon insert.  Specifically, a new array of branches must be allocated for the specific parent node of the inserted value and so on recursively on to the root.  The advantage to this partially-persistent implementation is that we can take advantage of the constant access time afforded by the use of arrays under the surface.  The unfortunate truth is that fully persistent data structures do not offer constant access time (at least, none that I know of), and thus are generally unsuitable for implementing random-access vectors.

The idea for this implementation comes originally from Phil Bagwell (interestingly enough, a researcher at LAMP, Martin Odersky’s research department at EPFL) in a paper entitled "Ideal Hash Trees".  His original concept though was for a more efficient hash table data structure, not necessarily with immutability as a requirement.  This concept was then adapted by Rich Hickey for his language, Clojure.  Finally, I expanded upon Clojure’s persistent vectors somewhat by changing their trie distribution from little- to big-endian and wrote up the result in Scala.  There are some other minor differences between Hickey’s design and my own, but the data structures are essentially identical.

Like my Vector implementation, the idea for IntMap began its life as a research paper, this time by Chris Okaski and Andrew Gill.  This paper is quite an interesting read if you’ve got a spare afternoon, although it does make use of SML rather than Scala as a base language.  In summary, the idea was to create an efficient, persistent map structure for integer keys.  Superficially, this sounds quite similar to Hickey’s modification of Bagwell’s concept, but there are many important distinctions below the surface.

At the extremely lowest level, IntMap actually makes use of a structure known as a "Patricia trie" with a fixed branching factor of two.  Much like Vector, IntMap encodes the key (index) of the data node within its path.  This encoding is based on the individual bits of the index.  However, unlike Vector, the ordering is little-endian.  Also, to avoid needlessly growing trees to absurd depths, linear common sub-keys are merged into a single prefix node.  This is what differentiates Patricia tries.  To illustrate, if our branching factor were 10 and we were to store at indexes 1234 and 2234, the common prefix "234" would be represented in a single node, rather than three separate nodes trivially linked to one-another.

This use of a low branching factor in the trie is extremely useful when performing insertions.  Specifically, more of the trie structure is preserved untouched from one modification to another.  Literally, IntMap is more persistent than Vector.  While this is great for writes, it does make read times a little less efficient.  Specifically, IntMap reads are worst-case O( log2(k) ), where k is the index in question.  For random data input, the average case is reduced somewhat by the prefix collapsing, but this should not be too significant.

By contrast, Vector uses an extremely high branching factor (by default) and so offers read efficiency which is O( logb(k) ), where k is the index and b is the branching factor.  Due to the practical limitations imposed on integer length, this translates into an upper-bound of O(7), which is (for all intents and purposes) constant.  Unfortunately, this analysis does not seem to be born-out by the performance data.

Possible Explanation

The only answer I can think of to explain the disparity between IntMap and Vector (both in time and in space utilization) is the use of a List[Int] in Vector to find the target node in the data structure.  This List is required primarily because I wanted the data distribution in the trie to be optimized for sequential access, therefore requiring the trie encoding to be big-endian on the index rather than little-endian.  The unfortunate truth is there’s no clean mathematical method (that I know of) which would allow the deconstruction of a number based on its most significant value in an arbitrary base.  In fact, the implementation of computePath (as suggested by Jules) actually cheats and makes use of the fact that persistent List(s) are constructed from the tail-end:

@inline
def computePath(total: Int, base: Int) = {
  if (total < 0) {
    throw new IndexOutOfBoundsException(total.toString)
  } else {
    var back: List[Int] = Nil
    var num = total
 
    do {
      back = (num % base) :: back
      num /= base
    } while (num > 0)
 
    back
  }
}

As efficient as this method is on most modern processors, it can never be as fast as a simple bit-masking operation.  Not only that, but it requires the creation of massive numbers of small, immutable objects (cons cells).  I believe that it is this instantiation overhead which is eating up the extra memory on reads and killing the overall performance.

Unfortunately, I can’t seem to conceive a better way of doing big-endian data distribution.  So are there any clever mathy people out there who have a brilliant way of deconstructing the index head-first rather than from the tail end?  If I could do that, then I could remove the List entirely from the implementation and rely instead on in-place calculations.  Maybe then I could catch up with David’s blisteringly fast IntMap:-)

Implementing Persistent Vectors in Scala

26
Aug
2008

Oh yeah, we’re really getting into Digg-friendly titles now, aren’t we?

The topic of persistent vectors is one of those odd backwaters of functional programming that few dare to approach.  The basic idea behind it all is to try to devise an immutable data structure which has the performance characteristics of a mutable vector.  What this means for practical application is that you shouldn’t have to deal with O(n) efficiency on random access like you do with List(s).  Instead, accessing arbitrary indexes should be constant time (O(1)), as should computing its length.  Additionally, modifying an arbitrary index should be reasonably efficient - as close to constant-time as possible.

Of course, the word “modifying” is a trifle inaccurate as it implies a mutable data structure.  One of the primary qualifications for a purely functional vector is that it is completely immutable.  Any changes to the vector result in the creation of a new vector, rather than modifying the old.  Basically, it’s an immutable data structure just like any other, but one which retains the brutal efficiency of its mutable counterpart.

Unfortunately, this turns out to be a very tough nut to crack.  A number of researchers have attempted different strategies for solving the problem, none of which have been entirely successful.  Rich Hickey, the creator of Clojure, has a brilliant presentation that essentially describes the solution I have chosen.  For the impatient, the good stuff starts at about 36:00 and lasts for about ten minutes.  I’ll elaborate on the problems with functional vectors a bit more in a second, but first a bit of motivational propaganda…

Thou Shalt Not Mutate

There is a single principle which should be drilled into skulls of all programmers everywhere: mutable data structures are bad news.  Don’t get me wrong, I love ArrayList as much as the Java addict, but such structures cause serious problems, particularly where concurrency is concerned.  We can consider the trivial example where two threads are attempting to populate an array simultaneously:

private String[] names = new String[6];
private int index = 0;
 
public static void main(String[] args) {
    Thread thread1 = new Thread() {
        public void run() {
            names[index++] = "Daniel";
            names[index++] = "Chris";
            names[index++] = "Joseph";
        }
    };
 
    Thread thread2 = new Thread() {
        public void run() {
            names[index++] = "Renee";
            names[index++] = "Bethany";
            names[index++] = "Grace";
        }
    };
 
    thread1.start();
    thread2.start();
 
    thread1.join();
    thread2.join();
 
    for (String name : names) {
        System.out.println(name);
    }
}

What does this snippet print?  I don’t know.  It’s actually indeterminate.  Now we can guess that on most machines the result will be essentially interleaved between the two threads, but there is no way to guarantee this.  Part of the reason for this is the fact that arrays are mutable.  As such, they enable (and indeed, encourage) certain patterns which are highly destructive when employed asynchronously.

However, concurrency is not even the only motivation for immutable data structures.  Consider the following example:

List<String> names = ...
for (String name : names) {
    names.add(name.toUpperCase());
}

I’m sure all of us have done something like this, most likely by accident.  The result is (of course) a ConcurrentModificationException caused by the fact that we are attempting to add to a List while we are iterating over its contents.  I know that the first time I was faced with this error message I became extremely confused.  After all, no threads are being employed, so why is this a problem?

Iterators are extremely dependent on the internal state of their data structure.  Anyone who has ever implemented an iterator for a linked list or (even better) a tree will attest to this fact.  This means that generally speaking, there is no way for an iterator to guarantee correctness if that structure is changing out from underneath it (so to speak).  Things may be fine in a linear structure like a list, but as soon as you get into anything non-linear like a tree or a hash table it becomes difficult to even define what the “correct” behavior should be.  Think about it; should the iterator backtrack to hit the missing elements?  Should this backtracking include elements which have already been consumed?  What if the order changes dramatically and pre-consumed elements are now ahead of the current index?  There are a whole host of problems associated with iterating over mutable data structures, and so rather than vainly attempt to solve these issues in a sane and consistent manner, JDK collections simply throw an exception.

All of this becomes moot once you start using immutable data structures.  There is no way to modify a structure while iterating over it because there is no way to modify the structure at all!  Concurrency is not an issue because without any mutable state to require locking, every thread can operate simultaneously on the structure.  Not only is it thread safe, but it is unsynchronized and thread safe.  Immutable data structures retain all of the asynchronous throughput of non-locking implementations without any of the race conditions and non-determinacy which usually results.

A Brief Note on Idioms

At this point, the question you must be asking yourself is: “So if the data structure is immutable, what good is it?”  The answer is “for reading”.  Data structures spend most of their lives being read and traversed by other code.  Immutable data structures can be read in exactly the same fashion as mutable ones.  The trick of course is constructing that data structure in the first place.  After all, if the data structure is completely immutable, where does it come from?  A simple example from a prior article is sufficient to demonstrate both aspects:

def toSet[T](list: List[T]): Set[T] = list match {
  case hd :: tail => hd + toSet(tail)
  case Nil => Set[T]()
}

This is neither the most concise nor the most efficient way to accomplish this task.  The only purpose served by this example is to illustrate that it is very possible to build up immutable data structures without undue syntactic overhead.  You’ll notice that every time we want to “change” a data structure - either removing from the list or adding to the set - we use a function call and either pass or return the modified structure.  In essence, the state is kept entirely on the stack, with each new version of the data structure in question becoming the “changed” version from the previous operation.

This idiom is actually quite powerful and can be applied to even more esoteric (and less uniformly iterative) operations.  As long as you are willing to let execution jump from one function to the next, it becomes extremely natural to deal with such immutability.  In fact, you start to think of immutable data structures as if they were in fact mutable, simply due to the fact that you are idiomatically “modifying” them at each step.  Note that this pattern of modifying data between functions is critical to actor-based programming and any seriously concurrent application design.

Problems Abound

For the sake of argument, let’s assume that my pitiful arguments have convinced you to lay aside your heathen mutating ways and follow the path of functional enlightenment.  Let’s also assume that you’re consumed with the desire to create an application which tracks the status of an arbitrary number of buttons.  These buttons may be pressed in any order regardless of what other buttons are already pressed.  Following the path of immutability and making use of the patron saint of persistent data structures (List), you might come up with the following solution:

class ButtonStrip private (buttons: List[Boolean]) {
  def this(num: Int) = {
    this((0 until num).foldLeft(List[Boolean]()) { (list, i) =>
      false :: list
    })
  }
 
  def status(index: Int) = buttons(index)
 
  def push(index: Int) = modify(index, true)
 
  def unpush(index: Int) = modify(index, false)
 
  /**
   * Modify buttons list and return a new ButtonStrip with the results.
   */
  private def modify(index: Int, status: Boolean) = {
    val (_, back) = (buttons :\ (buttons.length - 1, List[Boolean]())) { (tuple, button) =>
      val (i, total) = tuple
      (if (i == index) status else button) :: total
    }
    new ButtonStrip(back)
  }
}

This is a horrible mess of barely-idiomatic functional code.  It’s difficult to read and nearly impossible to maintain; but it’s purely immutable!  This is not how you want your code to look.  In fact, this is an excellent example of what David MacIver would call “bad functional code“.

Perhaps even worse than the readability (or lack thereof) is the inefficiency of this code.  It’s terribly slow for just about any sort of operation.  Granted, we can imagine this only being used with a list of buttons of limited length, but it’s the principle of the thing.  The fact is that we are relying on a number of operations which are extremely inefficient with lists, most prominently length and apply() (accessing an arbitrary index).  Not only that, but we’re recreating the entire list every time we change the status of a single button, something which is bad for any number of reasons.

What we really need here is a random-access structure, something which allows us to access and “change” any index with some degree of efficiency.  Likely the most intuitive thing to do here would be to just use a good ol’ array of Boolean(s) and make a new copy of this array any time we need to change something.  Unfortunately, this is almost as bad as our copying the list every time.  Normally, when you use an immutable data structure, modifications do not require copying large amounts of data.  Our toSet example from above uses almost zero data copying under the surface, due to the way that Set and List are implemented.

Specifically, Set and List are persistent data structures.  This doesn’t mean that they live on a filesystem.  Rather, the term “persistent” refers to the fact that each instance of the collection may share significant structure with another instance.  For example, prepending an element onto an immutable list yields a new list which consists of the new element and a tail which is precisely the original list.  Thus, each list contains its predecessor (if you will) within itself.  List is an example of a fully persistent data structure; not everything can be so efficient.  Set and Map for example are usually implemented as some sort of tree structure, and so insertions require some data copying (specifically, the parent nodes).  However, this copying is minimized by the nature of the structure.  This notion of persistence in the data structure works precisely because these structures are immutable.  If you could change an element in a persistent data structure it would likely result in the modification of that same element in totally disparate instances of that structure across the entire runtime (not a pleasant outcome).

So List is persistent, arrays are not.  Even if we treat arrays as being completely immutable, the overhead of copying a potentially huge array on each write operation is rather daunting.  What we need is some sort of data structure with the properties of an array (random access, arbitrary modification) with the persistent properties of a list.  As I said before, this turns out to be a very difficult problem.

Partitioned Tries

One solution to this problem which provides a compromise between these two worlds is that of a partitioned trie (pronounced “try” by all that is sane and holy).  In essence, a partitioned trie is a tree of vectors with a very high branching factor (the number of children per node).  Each vector is itself a tree of vectors and so on.  Note that these are not not like the binary search trees that every had to create back in college, partitioned tries can potentially have dozens of branches per node.  As it turns out, it is this unusual property which makes these structures so efficient.

To get a handle on this concept, let’s imagine that we have a trie with a branching factor of 3 (much smaller than it should be, but it’s easier to draw).  Into this vector we will insert the following data:

 Index 
Data
0
Daniel
7
Chris
2
Joseph
4
Renee
1
Bethany
3
Grace
9
Karen
13
Larry
5
Moya

After all the jumbling necessary to make this structure work, the result will look like the following:

image

It’s hard to see where the “trie” part of this comes into play, so bear with me.  The important thing to notice here is the access times for indexes 0-2: it’s O(1).  This is of course a tree, so not all nodes will be one step away from the root, but the point is that we have achieved constant-time access for at least some of the nodes.  Mathematically speaking, the worst-case access time for any index n is O( log3(n) ).  Not too horrible, but we can do better.

First though, we have to lay some groundwork.  I said that the structures we were working with are tries rather than normal trees.  A trie implies that the key for each element is encoded in the path from the root to that node.  So far, it just appears that we have built a fancy tree with numeric keys and a higher branching factor than “normal”.  This would be true if all we were given is the above diagram, but consider the slightly modified version below:

image

Structurally, nothing has changed, but most of the edges have been renumbered.  It is now a bit more apparent that each node has three branches numbered from 0 to 2.  Also, with a little more thought, we can put the rest of the pieces together.

Consider the “Moya” node.  In our input table, this bit of data has an index of 5.  To find its “index” in our renumbered trie, we follow the edges from the root down to the destination node, arriving at a value of 12.  However, remember that each node has only 3 branches.  Intuitively, we should be thinking about base-3 math somewhere about now.  And indeed, converting 12 into base-3 yields a final value of 5, indicating that the index of the node is indeed encoded in the path from the root based on column.  By the way, this works on any node (try it yourself).  The path to “Karen” is 100, which converted into base-3 becomes 9, the input index of the element.

This is all fine and dandy, but we haven’t really solved our original problem yet: how to achieve constant-time access to arbitrary indexes in a persistent structure.  To really approach a solution to our problem, we must increase the branching factor substantially.  Rather than working with a branching factor of 3 (and thus, O( log3(n) ) efficiency), let’s dial the branching factor up to 32 and see what happens.

The result is completely undiagramable; but it does actually provide constant time access for indexes between 0 and 31.  If we were to take our example data set and input it into our revised trie, the result would be a single layer of nodes, numbered at exactly their logical index value.  In the worst case, the efficiency of our more complicated trie is O( log32(n) ).  Generally speaking, we can infer (correctly) that for any branching factor b and any index n, the lookup efficiency will be precisely logb(n).  As we increase the branching factor, the read-efficiency of our trie increases exponentially.  To put a branching factor of 32 into perspective, this means that the algorithmic complexity of accessing index 1,000,000 would only be 3.986!  That’s incredibly small, especially given the sheer magnitude of the index in question.  It’s not technically constant time, but it’s so incredibly small for all conceivable indexes that we can just pretend that it is.  As Rich Hickey says:

…when it’s a small enough value, I really don’t care about it.

So that takes care of the reading end of life, but what about writing?  After all, if all we needed was constant time lookups, we could just use an array.  What we really need to take care to do is ensure that modifications are also as fast as we can make them, and that’s where the tricky part comes in.

We can think of an array as a partitioned trie with a branching factor of infinity.  When we modify an array immutably, we must copy every element from the original array into a new one which will contain the modification.  This contrasts with List - effectively, a partitioned trie with a branching factor of 1 - which in the best case (prepending) requires none of the elements to be copied.  Our 32-trie is obviously somewhere in between.  As I said previously, the partitioned trie doesn’t really solve the problem of copying, it just compromises on efficiency somewhat (the difference between 1 and 3.986).

The truth is that to modify a partitioned trie, every node in the target sub-trie must be copied into a new subtrie, which then forces the copying of its level and so-on recursively until the root is reached.  Note that the contents of the nodes are not being copied, just the nodes themselves (a shallow copy).  Thus, if we return to our example 3-trie from above and attempt to insert a value at index 12, we will have to copy the “Larry” node along with our new node to form the children of a copy of the “Renee” node.  Once this is done, the “Grace” and “Moya” nodes must also be copied along with the new “Renee” to form the children of a new “Bethany”.  Finally, the “Daniel” and “Joseph” nodes are copied along with the new “Bethany” to form the children of a new root, which is returned as the modified trie.  That sounds like a lot of copying, but consider how much went untouched.  We never copied the “Karen” or the “Chris” nodes, they just came over with their parent’s copies.  Instead of copying 100% of the nodes (as we would have had it been an array), we have only copied 80%.  Considering that this was an example contrived to force the maximum copying possible, that’s pretty good!

Actually, we can do even better than this by storing the children of each node within an array (we would have to do this anyway for constant-time access).  Thus, only the array and the modified nodes need be copied, the other nodes can remain untouched.  Using this strategy, we further reduce the copying from 80% to 30%.  Suddenly, the advantages of this approach are becoming apparent.

Now of course, the higher the branching factor, the larger the arrays in question and so the less efficient the inserts.  However, insertion is always going to be more efficient than straight-up arrays so long as the inserted index is greater than the branching factor.  Considering that most vectors have more than 32 elements, I think that’s a pretty safe bet.

Implementation

I bet you thought I was going to get to this section first.  Foolish reader…

Once we have all this theoretical ground-work, the implementation just falls into place.  We start out with a Vector class parameterized covariantly on its element type.  Covariant type parameterization just means that a vector with type Vector[B] is a subtype of Vector[A] whenever B is a subtype of AList works this way as well, as do most immutable collections, but as it turns out, this sort of parameterization is unsafe (meaning it could lead to a breakdown in the type system) when used with mutable collections.  This is part of why generics in Java are strictly invariant.

Coming back down to Earth (sort of), we consider for our design that the Vector class will represent the partitioned trie.  Since each child node in the trie is a trie unto itself, it is only logical to have each of the nodes also represented by Vector.  Thus, a Vector must have three things:

  • data
  • length
  • branches

Put into code, this looks like the following:

class Vector[+T] private (private val data: Option[T], 
      val length: Int, private val branches: Seq[Vector[T]]) extends RandomAccessSeq[T] {
  def this() = this(None, 0, new Array[Vector[T]](BRANCHING_FACTOR))
 
  // ...
}

RandomAccessSeq is a parent class in the Scala collections API which allows our vector to be treated just like any other collection in the library.  You’ll notice that we’re hiding the default constructor and providing a no-args public constructor which instantiates the default.  This only makes sense as all of those fields are implementation-specific and should not be exposed in the public API.  It’s also worth noting that the branches field is typed as Seq[Vector[T]] rather than Array[Vector[T]].  This is a bit of a type-system hack to get around the fact that Array is parameterized invariantly (as a mutable sequence) whereas Seq is not.

With this initial design decision, the rest of the implementation follows quite naturally.  The only trick is the ability to convert an index given in base-10 to the relevant base-32 (where 32 is the branching factor) values to be handled at each level.  After far more pen-and-paper experimentation than I would like to admit, I finally arrived at the following solution to this problem:

def computePath(total: Int, base: Int): List[Int] = {
  if (total < 0) {
    throw new IndexOutOfBoundsException(total.toString)
  } else if (total < base) List(total) else {
    var num = total
    var digits = 0
    while (num >= base) {
      num /= base
      digits += 1
    }
 
    val rem = total % (Math.pow(base, digits)).toInt
 
    val subPath = computePath(rem, base)
    num :: (0 until (digits - subPath.length)).foldRight(subPath) { (i, path) => 0 :: path }
  }
}

As a brief explanation, if our branching factor is 10 and the input index (total) is 20017, the result of this recursive function will be List(2, 0, 0, 1, 7).  The final step in the method (dealing with ranges and folding and such) is required to solve the problem of leading zeroes dropping off of subsequent path values and thus corrupting the final coordinates in the trie.

The final step in our implementation (assuming that we’ve got the rest correct) is to implement some of the utility methods common to all collections.  Just for demonstration, this is the implementation of the map function.  It also happens to be a nice, convenient example of good functional code.  :-)

override def map[A](f: (T)=>A): Vector[A] = {
  val newBranches = branches map { vec =>
    if (vec == null) null else vec map f
  }
 
  new Vector(data map f, length, newBranches)
}

Properties

Before moving on from this section, it’s worth noting that that our implementation of the vector concept has some rather bizarre properties not held by conventional, mutable vectors.  For one thing, it has a logically infinite size.  What I mean by this is it is possible to address any positive integral index within the vector without being concerned about resize penalties.  In fact, the only addresses which throw an IndexOutOfBoundsException are negative.  The length of the vector is defined to be the maximum index which contains a valid element.  This actually mirrors the semantics of Ruby’s Array class, which also allows placement at any positive index.  Interestingly enough, the efficiency of addressing arbitrary indexes is actually worst-case much better in the persistent trie than it is in a conventional amortized array-based vector.

Vectored Buttons

Since we now have an immutable data structure with efficient random-access, we may as well rewrite our previous example of the button strip using this structure.  Not only is the result far more efficient, but it is also intensely cleaner and easier to read:

class ButtonStrip private (buttons: Vector[Boolean]) {
  def this(num: Int) = this(new Vector[Boolean])     // no worries about length
 
  def status(index: Int) = buttons(index)
 
  def push(index: Int) = new ButtonStrip(buttons(index) = true)
 
  def unpush(index: Int) = new ButtonStrip(buttons(index) = false)
}

You’ll notice that the update method is in fact defined for Vector, but rather than returning Unit it returns the modified Vector.  Interestingly enough, we don’t need to worry about length allocation or anything bizarre like that due to the properties of the persistent vector (infinite length).  Just like arrays, a Vector is pre-populated with the default values for its type.  In the case of most types, this is null.  However, for Int, the default value is 0; for Boolean, the default is false.  We exploit this property when we simply return the value of dereferencing the vector based on any index.  Thus, our ButtonStrip class actually manages a strip of infinite length.

Conclusion

The truth is that we didn’t even go as far as we could have in terms of optimization.  Clojure has an implementation of a bit-partitioned hash trie which is basically the same thing (conceptually) as the persistent vector implemented in this article.  However, there are some important differences. 

Rather than jumping through strange mathematical hoops and hacky iteration to develop a “path” to the final node placement, Clojure’s partitioned trie uses bit masking to simply choose the trailing five bits of the index.  If this node is taken, then the index is right-shifted and the next five bits are masked off.  This is far more efficient in path calculation, but it has a number of interesting consequences on the final shape of the trie.  Firstly, the average depth of the tree for random input is less, usually by around 1 level (on average).  This means that the array copying on insert must occur less frequently, but must copy more references at each step.  Literally, the trie is more dense.  This probably leads to superior performance.  Unfortunately, it also requires that the index for each value be stored along with the node, requiring more memory.  Also, this sort of “bucket trie” (to coin a phrase) is a little less efficient in lookups in the average case.  Not significantly so, but the difference is there.  Finally, this masking technique requires that the branching factor be a multiple of 2.  This isn’t such a bad thing, but it does impose a minor restriction on the flexibility of the trie.

Most importantly though, Clojure’s trie uses two arrays to contain the children: one for the head and one for the tail.  This is a phenomenally clever optimization which reduces the amount of array copying by 50% across the board.  Rather than having a single array of length 32, it has two arrays of length 16 maintained logically “end to end”.  The correct array is chosen based on the index and recursively from there on down the line.  Naturally, this is substantially more efficient in terms of runtime and heap cost on write.

At the end of the day though, these differences are all fairly academic.  Just having an idiomatic partitioned trie for Scala is a fairly significant step toward functional programming relevance in the modern industry.  With this structure it is possible to still maintain lightning-fast lookup times and decent insertion times without sacrificing the critical benefits of immutable data structures.  Hopefully structures like this one (hopefully, even better implementations) will eventually find their way into the standard Scala library so that all may benefit from their stream-lined efficiency.

Update

The implementation of Vector which I present in this article is inherently much less efficient than the one Rich Hickey created for Clojure.  I finally broke down and created a line-for-line port from Clojure’s Java implementation of PersistentVector into a Scala class.  I strongly suggest that you use this (much faster) implementation, rather than my own flawed efforts.  :-)  You can download the improved Vector here: Vector.scala.