Skip to content

Bloom Filters in Scala (and all the fun that they bring)

13
Oct
2008

This is a post I have been meaning to write for quite some time.  Despite being an officially excommunicated former-member of the Ruby community, I still like to keep up on the latest developments.  A few months ago, my Ruby Inside feed produced an interesting post introducing a new library by Peter Cooper (the author of Ruby Inside).  Unlike the most “new” Ruby libraries which are most often just bad clones of Rails snibblets, Peter’s WhatLanguage seemed to be quite unique and interesting.  For the first time in years, I was reading about an innovation in Ruby-land which actually sounded quite exciting.  A good feeling to be sure.

In a nutshell, WhatLanguage (project page at GitHub) is a statistical language identification API which analyses arbitrary input String(s), scoring them against a set of built-in languages (including English, French, Spanish, Russian, and others).  Add to that a little bit of meta-programming magic and the possibilities are eye-catching to say the least:

require 'whatlanguage'
 
'Here is some text to identify'.language       # => :english
'¿Hola, pero donde está su pelo?'.language     # => :spanish

To be honest, I’m not sure what practical application I would have for such a library, but it sure looks like fun!  :-)  Under the surface, the implementation is deceptively simple.  WhatLanguage bundles a set of words for each language it supports.  The implementation then searches these sets for each word in the string, aggregating a score for each language depending on whether or not the word was found.  If exactly one language has a higher score than the others once the end of the string is reached, it is “declared the winner” and returned as a result.  Otherwise, the analysis is undecided and nil is returned.  Simple, right?

The tricky part is somehow storing and searching those gigantic language word sets in an efficient manner.  As it turns out, this isn’t so easy.  Just stop and consider the number of words which exist in your native tongue.  Unless you speak a very strange, primarily-tonal language, the number is likely in the upper hundreds of thousands.  Just going off of Peter’s word lists, Russian is the real trouble maker, weighing in with a nearly one million word vocabulary.  Any attempt to store that much data would require megabytes of memory and several minutes to process, even on a fast machine.  Any sort of hash set is completely out of the question due to the nature of contiguous memory allocation, but can you imagine trying to perform a search traversal on a tree with almost a million nodes?  Obviously, a more effective solution is required.

Probabilistic Sets

So far we have been focusing our imaginary efforts on data structures which guarantee 100% accuracy and full reversibility.  That is to say, data structures which allow you to get out what you put in and which ensure that if something is “in” it will be found and if something is “out” it will not.  After all, is a data structure without these guarantees really all that useful?

As it turns out, the answer is “yes”.  Consider a standard hash set as an example.  Sets make absolutely no guarantee about the ordering of their elements.  If you attempt to iterate over a set, you will eventually reach everything inside, but you cannot rely on a certain element coming before the others.  Of course, the advantage we gain by making this trade-off is constant time searching by item identity.  Lists cannot achieve this performance, neither can arrays.  If you want to search either one of these structures, you must look through every element from start to finish.  So with hash sets, we have effectively traded a fundamental guarantee about our data for efficiency in some other respect.

The same is true for probabilistic sets.  We don’t actually need all of the properties of a set to implement WhatLanguage.  For example, we don’t need to actually retrieve a word from our dictionary, we only need to discover whether or not that word is in the set.  In fact, we really only need our data structure to have two important properties:

  • Lightning fast boolean searches (does the set contain this element?)
  • Minimal memory footprint

Hash sets can make the first guarantee, but as we have already noticed, they don’t really require any less memory than just storing everything in an array.  Since we can’t afford one million words-worth of memory per language, we need to come up with some way of storing information about the presence of an element without retaining the data it represents.  In order to do this, we have to make one important sacrifice: accuracy.

We basically need to accept the fact that our data structure may sometimes return inaccurate results.  As it turns out though, this is just fine.  Remember, we’re scoring languages one token at a time; an incorrect answer now and again isn’t going to kill us.  Given a large enough input string, a minimal percentage of inaccuracy will literally be lost in the noise.

As it turns out, there already exists a very old technique for accomplishing exactly these requirements.  Bloom filters have been around since the beginning of time (or at least the 1980s), but unfortunately very few developers are aware of them.  In a nutshell, they precisely satisfy all of the needs we just finished enumerating.  They provide true (not amortized) constant-time searches and will consistently return true or false for a given element, depending on whether or not the value is in the set.  Occasionally they will give false positives (claiming that an element is in the set when in fact it is not), but never false negatives.  This property is extremely useful in many applications, but it turns out that this is even more than we need for WhatLanguage.

Most importantly though, bloom filters are extremely compact in memory.  Rather than storing an entire value, they hash every element to one or (usually) more indexes in an array of bits (0 or 1).  In an optimal bloom filter guaranteeing no false positives, each element would correspond to precisely k bits of memory (where k is the number of hash functions, usually in the vicinity of 3 or 4).  Once this hashing is complete, the value itself may be discarded.  Thus, assuming an optimal bloom filter with 3 hash functions, the storage requirements for one million words would only be 375 KB of memory.  Compare this to the 8 MB required just to store the pointers to each of the words, much less the values themselves!  If we include the underlying character values, we start pushing closer to 20 MB.

375 KB for a million elements is impressive, but bloom filters can actually do a lot better than that.  Remember, we agreed to accept some false positives.  It’s easy to imagine how even an ideal bloom filter (such as the one we have been analyzing) could have some inaccuracy, but if we’re willing to live with a few more false positives, we can allow hash collisions and drop our memory requirements into the 200 KB range.  Not bad, considering how much data we’re storing!

Bloom Filters

So how does the magic actually work?  We’ve hinted at the implementation in our analysis of memory requirements, but we’re still missing some important details.  Consider the following diagram (courtesy of Wikipedia):

Under the surface, a bloom filter is just an array of boolean values.  The length of the array is referred to as the width of the bloom filter.  When the filter is empty, all of the array values are set to false.  To insert an element, we compute k different hash indexes in the array and set them all to true.  The real beauty of the data structure is the manner in which we handle collisions: we don’t worry about it.  Whenever one (or more) of the hash functions returns an index which is already true, we simply leave that index as-is and move on.  Thus, the procedure for inserting an element into a “full” bloom filter is precisely the same as inserting into an empty one: set all of the hash indexes to true.  In a sense, bloom filters can never be full, they just get less and less accurate.

Bloom filters define only one operation other than insertion: searching.  To determine whether or not a value is in the set, we run all of our hash functions and find the indexes to which it corresponds.  If all of these indexes are true, then we can assume that the value is indeed in the set.  Conversely, if any one of indexes is false, then the value cannot be in the set since we would have set that value to true upon insertion.

It is important to note that removal is not supported by bloom filters.  Intuitively we could just take the value to be removed, calculate its hash indexes and then set the corresponding elements to false.  However, if any other values also hash to even one of those locations, we will be effectively removing it from the bloom filter “by accident”.  Subsequent searches for that value will hash to the index which we erroneously set to false, resulting in a false negative on the search (saying that an element is not in the set when in fact it is).  Fortunately for us, WhatLanguage only requires insertion and searching for values, removal is not on the agenda.

Scala Implementation

Well, this has probably been a nostalgic (for those of you who enjoyed your university days) stroll through the land of data structure theory, but we’re still lacking one critical factor: a concrete implementation.  Bloom filters don’t sound that complicated on paper, but to do anything useful with them we’re going to need some code.

Before we dive in, we might want to define some more precise requirements for our implementation.  We’re writing this in Scala, so it would be nice to implement bloom filters as a functional data structure.  I’m not going to get into all the benefits of immutable data structures; suffice it to say, I really think that they’re the only way to go for most serious applications.  Unfortunately, in order to do this, we’re going to need some sort of immutable, array-like data structure under the surface which provides random access in constant time.  Sound familiar?

This is actually where the whole saga of persistent vectors in Scala found its roots.  The first time I tried to implement a functional bloom filter in Scala I made use of a conventional array under the surface with copy-on-write semantics.  As you can imagine, this is hideously inefficient in both space and time.  In order to solve the problem, I had to implement a data structure which was just as efficient for reading and writing, but which also shared structure with subsequent versions.  Now, two months later, I’ve borrowed just such a design from Clojure and created a first-class implementation in Scala.

With a persistent vector now at our disposal, we can easily implement a persistent bloom filter by using a vector to represent our boolean indexes.  Every time we perform an insertion, we will “modify” the relevant indexes, creating a new instance of the vector.  This vector can then be used in our “modified” bloom filter, returned as a new instance of our container class.  Since persistent vectors share unmodified structure between versions, so does our bloom filter.  All of the hard work of efficient immutability is already done for us, we just need to implement a thin facade.  We start out by defining the class skeleton:

import BloomSet._
 
class BloomSet[A] private (val size: Int, val k: Int, 
                           private val contents: Vector[Boolean]) extends ((A)=>Boolean) {
  val width = contents.length
 
  def this(width: Int, k: Int) = this(0, k, alloc(width))
}
 
object BloomSet {
  def alloc(size: Int) = {
    (0 until size).foldLeft(Vector[Boolean]()) { (c, i) => c + false }
  }
}

We’ll keep all of our utility functions in the companion object.  There’s really no need to put them in the main class, and the JVM is able to optimize dispatch to static members to a slightly higher degree than it can instance members (invokestatic).  Technically, we don’t really need the alloc function, it just hides some of the boiler-plate involved in initializing a vector to a certain length.  Note that this operation is extremely fast, even for high values of width.  Despite the fact that we are creating a huge number of Vector objects, very little memory is actually wasted.  Even at this early stage in the implementation, the advantages of persistence are evident.

A Little Math

You’ll notice that our bloom filter has three major properties: size, k, and width.  As previously mentioned, k is the number of hash functions to be used, while width is the length of the array used to store the boolean values.  These values have to be user-configurable because different applications will call for different optimizations.  For example, implementing WhatLanguage requires the storage of almost a million elements per set.  Obviously we would not want to attempt to use a bloom filter which only had a width of (say) 200.  This wouldn’t cause any errors per se (remember, bloom filters are never “full”), but the accuracy of such a filter would be almost nil.  With a million elements all hashed into a mere 200 indexes, it is likely that that every single index would have a value of true.  Such a bloom filter would return true when searching for any value, including those which were not inserted.  We said we would tolerate some false positives, but we still want the results to be meaningful.  Since the width of a bloom filter cannot be changed once data is inserted (at least, not without “losing” data), the user must be able to decide in advance how much space will be required and then instantiate BloomSet accordingly.

The value of k is an even more interesting proposition.  The intuition is that the larger our value of k (thus, the more hash functions we use), the greater the accuracy of our bloom filter and the fewer false positives we will return.  This is technically true when the bloom filter does not contain very many elements, but as more and more values are inserted, higher values of k become problematic.  Wikipedia has an excellent (and extremely readable) mathematical analysis of this and other properties of bloom filters.  As it turns out, the optimal value for k (one which will yield the fewest false positives) can be computed by the following expression (where m is the width and n is the size, or the number of elements contained in the filter):

k = \mbox{ln}(2) \; \frac{m}{n}

For the sake of quick-computation, this is essentially equivalent to:

k \approx \frac{9m}{13n}

The equivalence isn’t precise (logarithms are almost always irrational), but since we’re already rounding the result to the nearest integer, we don’t need to be too concerned.

While it would be nice if we could just auto-magically set k to the optimum value and be done with it, that just isn’t the case.  Like width, k cannot be changed after data has been inserted.  Unfortunately, its optimum value is dependent on the number of items which will eventually be in the set.  Since we don’t really “know” this value within the data structure until after the insertions are complete, there is no way for us to self-optimize.  To provide maximum configurability, we need to allow this value to be set by the API consumer on a case-by-case basis.

As an interesting aside, we can actually compute the accuracy of a bloom filter on a scale from 0 to 1 given values for width, k and size.  In this case, accuracy is defined as the odds that a search result is indeed accurate (the chance of it not being a false positive).  Wikipedia gives the following expression for this accuracy:

1 - (1 - e^{-kn / m})^k

Thanks to the marvels of approximation, this is a fairly efficient computation on modern platforms.  Translated into Scala, it looks something like this:

lazy val accuracy = {
  val exp = ((k:Double) * size) / width
  val probability = Math.pow(1 - Math.exp(-exp), k)
 
  1d - probability
}

Core Functionality

As interesting as this minor detour has been, we still haven’t arrived at a working implementation of the bloom filter data structure.  We have the basic structure in place, but we still need to implement the two core operations: insert and search, rendered in Scala as + and contains:

def +(e: A) = new BloomSet[A](size + 1, k, add(contents)(e))
 
def contains(e: A) = {
  (0 until k).foldLeft(true) { (acc, i) => 
    acc && contents(hash(e, i, width)) 
  }
}
 
protected def add(contents: Vector[Boolean])(e: Any) = {
  var back = contents
 
  for (i <- 0 until k) {
    back = back(hash(e, i, width)) = true
  }
 
  back
}

Feels sort of anti-climactic doesn’t it?  After all that analysis, we find that it is possible to implement the essence of a bloom filter in less than 17 lines of code.  Technically, I probably could trim this down to more like 8 lines, but this isn’t an exercise in Scala golf.  These three methods are just the materialization of all of the hand-wavy descriptions that we’ve spent the last five pages working through.  Each time we insert an element, we calculate k different hash values and set the appropriate vector indexes to true.  To find that element, we calculate the same k hashes and ensure that all of the indexes have an appropriate boolean value.

Hashing

Those of you who are actually reading the code snippets rather than just blindly trusting my veracity will probably notice that I’ve omitted a fairly important function from the above sample; specifically: hash.  Somehow, we need to define a function which will simulate the existence of a potentially unlimited number of hash functions.  Based on our usage of this function, we need to be able to grab a hash value for a given value and function index, restricted to a given width.  This is not a trivial problem.

As a good developer, you should already be trying to think up ways to reduce the code involved in implementing these requirements.  I mean, we could potentially just dream up a few dozen different hash functions with a fairly wide distribution of results.  However, I don’t know about you, but I really don’t have that much imagination.  I have a hard enough time developing just one solid hash function, much less dozens.  What’s more, if each hash function corresponds with a different method, then we can’t really scale the number of hash functions to any integer k.

A better approach would be to create a single hash function and then skew its value n times for every n between 0 and k.  Like most things that have to do with hashing, it’s cheating, but providing we do our job right, it should give us a fairly decent way of computing the required k hash functions.  Of course, the key phrase there is “do our job right”.  We need to make sure that we skew the initial hash value sufficiently for each successive hash, otherwise successive hash functions might get “stuck in a groove”.  For example, let’s assume that each successive hash function is equivalent to the previous hash + 2.  If we then inserted two elements into the bloom filter with k = 3, there would be the potential that the hash(0) for the first element would be precisely 2 greater than the hash(0) for the second.  If this is the case, then hash(1) for the second would equal hash(0) for the first.  Thus, the first and second elements would collide on two out of three indexes.  Not exactly an optimal distribution.

To avoid this problem, we should be sure to skew the hash value by a different amount for each hash function.  A convenient way of doing this is to use the number of iterations, or the index of the particular hash function.  To further improve the distribution, we will also use the bitwise XOR operation rather than simple addition.  It’s not really intuitively obvious, but XOR is mathematically far superior for hashing and just as cheap in terms of processor cycles.  Put it all together, and the implementation looks something like this:

def hash(e: Any, iters: Int, bounds: Int): Int = {
  Math.abs(
    if (iters == 0) e.hashCode 
    else iters ^ hash(e, iters - 1, bounds)
    ) % bounds
}

The final bit of math in this function mods our skewed hash value with the given bounds.  Whenever we call hash, we always pass the width of the bloom filter.  This is a trick that you’ll see a lot in any hash-based data structure.  Essentially, we’re just mapping our hash value into the required domain.  Hash values can be anything between -231 and 231 - 1, but our vector indexes can only be between 0 and width - 1.  Just a rule of thumb, but any time you have a problem where you must “map values from a large domain to a more restricted one”, mod is generally the correct approach.

Observant readers will notice that our implementation of hash is a little inefficient.  Actually, it’s worse than just a “little” inefficient: it’s downright terrible.  Remember that we have to call this method once for every integer n between 0 and k.  This isn’t so bad, since k is usually comparatively small, but the real problem is our sneaky use of recursion within hash.  The time complexity of hash is precisely O(n), where n is the number of iterations.  Once you factor this in with the number of times we call this function and it is easy to see how the total complexity of computing k hash values is actually O(k2).  Thus, for k = 4, we actually call hash 16 times!  You can imagine how this could get very problematic for higher values of k.

Technically, we could rewrite add and contains to compute the hash values incrementally from the bottom up.  This would bring the hashing complexity down to O(k).  However, this is a bit more elegant in terms of code, not to mention easier to talk about in a modular fashion.  In practice, this inefficiency is only a problem for higher values of k due to the fact that computing a single-step hash value is actually a very fast operation.  We will run into trouble in a little bit though when we attempt to optimize the bloom sets for shorter wordlists in WhatLanguage.

Serialization

One of the important requirements of WhatLanguage that we haven’t really touched on so far is the ability to store freeze-dried bloom filters on disk and then suck them back into memory quickly and efficiently.  If you think about it, this requirement only makes sense since one of our primary motivations from the start was to avoid the burdens associated with huge word lists.  After all, if our in-memory representation of a language is only a few hundred kilobytes, shouldn’t the on-disk representation be no worse?  Furthermore, if we had to parse a wordlist into a bloom filter every time we wanted to check a string’s language, the performance imposition would be utterly prohibitive.

Fortunately, there is a fairly obvious format which we could use to efficiently serialize an arbitrary bloom filter.  Recall that the real heart of a bloom filter is the array of boolean flags, an array which can also be represented as an array of bits.  We could very easily select these bits eight at a time, combine them together using bit-shifting and produce a byte value which could be sent to an OutputStream.  Thus, a bloom filter with an array of {true, false, true, true, true, false, false, true, false} would be stored as the following series of bits: 101110010, which in hexadecimal byte form becomes B9 00.

In addition to the array, we also need to store the size, k, and width of the bloom filter, otherwise it will be impossible to find anything reliably in the deserialized result.  The Scala store and load methods to accomplish this are verbose, but fairly straightforward:

class BloomSet[A] ... {
  ...
 
  def store(os: OutputStream) {
    os.write(convertToBytes(size))
    os.write(convertToBytes(k))
    os.write(convertToBytes(contents.length))
 
    var num = 0
    var card = 0
    for (b <- contents) {
      num = (num << 1) | (if (b) 1 else 0)    // construct mask
      card += 1
 
      if (card == 8) {
        os.write(num)
 
        num = 0
        card = 0
      }
    }
 
    if (card != 0) {
      os.write(num)
    }
  }
}
 
object BloomSet {
  def load[A](is: InputStream) = {
    val buf = new Array[Byte](4)
 
    is.read(buf)
    val size = convertToInt(buf)
 
    is.read(buf)
    val k = convertToInt(buf)
 
    is.read(buf)
    val width = convertToInt(buf)
 
    var contents = Vector[Boolean]()
    for (_ <- 0 until (width / 8)) {
      var num = is.read()
      var buf: List[Boolean] = Nil
 
      for (_ <- 0 until 8) {
        buf = ((num & 1) == 1) :: buf
        num >>= 1
      }
 
      contents = contents ++ buf
    }
 
    if (width % 8 != 0) {
      var buf: List[Boolean] = Nil
      var num = is.read()
 
      for (_ <- 0 until (width % 8)) {
        buf = ((num & 1) == 1) :: buf
        num >>= 1
      }
 
      contents = contents ++ buf
    }
 
    new BloomSet[A](size, k, contents)
  }
 
  ...
}

It is interesting to note that store is a good example of what David MacIver meant when he says that functional code image good code.  I actually tried to write this in a functional style to start with, but I gave up after it became horribly ugly (having four nested folds is never a good sign).  The imperative rendering (in this case) is concise and elegant, one of the many advantages of a hybrid languages like Scala.

This particular format for rendering a bloom set is almost precisely identical to the format used by Peter Cooper’s Bloomin’ Simple, the library which sits underneath WhatLanguage.  However, due to hashing differences it is impossible to use his serialized bloom filters in Scala or vice versa.

WhatLanguage in Scala

So now that we have our magic bloom filter ready and waiting, it’s time to take a crack at Peter’s fancy language identifier!  As it turns out, this is a lot simpler than the bloom filter itself.  Conceptually, all we have to do is load a different bloom filter for each language and then use that bloom filter to check each input string token individually.  For the sake of uniformity, we will convert all tokens to lower-case.

Of course, before we can load the language bloom filters we need to first store them, and we cannot store a bloom filter before it is generated.  To that end, we need to create a simple Scala script (yes, a script) which reads in a file of newline-delimited words, inserts them all into an instance of BloomSet and then stores the result in a corresponding language file.  Scala’s scripting functionality is actually quite good, especially for a statically typed language, and really deserves some better recognition.  The complete language generator script is as follows:

import com.codecommit.collection.BloomSet
import java.io.{BufferedOutputStream, File, FileOutputStream}
import scala.io.Source
 
val WIDTH = 2000000
 
def computeK(lines: Int) = (((9:Double) * WIDTH) / ((13:Double) * lines)).intValue
 
for (file <- new File("wordlists").listFiles) {
  if (!file.isDirectory) {
    println(file.getName)
    println("==========================")
 
    val src = Source.fromFile(file)
    val count = src.getLines.foldLeft(0) { (i, line) => i + 1 }
    println("  Word count: " + count)
 
    val optimalK = computeK(count)
    val init = new BloomSet[String](WIDTH, Math.min(optimalK, 100))
 
    println("  Optimal K: " + optimalK)
    println("  Actual K: " + init.k)
 
    val set = src.reset.getLines.foldLeft(init) { _ + _.trim }
 
    println("  Accuracy: " + set.accuracy)
 
    val os = new BufferedOutputStream(
        new FileOutputStream("src/main/resources/lang/" + file.getName))
    set.store(os)
    os.close()
 
    println()
  }
}

I originally created this script using JRuby, thinking of course that it would be much easier to perform a simple, one-off task like this in a dynamically typed language.  Interestingly enough, the JRuby version was about twice as long and actually took dramatically longer to run.  By “dramatically” I mean on the order of seventy times longer.  The above Scala script takes just over 30 seconds to run on my machine using the wordlists from WhatLanguage.  This stands in stark contrast to the 35 minutes required for the JRuby script.  Both scripts use the same underlying data structure to perform all the work (BloomSet), so it’s really hard to claim that one implementation was fundamentally slower than the other.  In short: JRuby does not seem terribly well-suited for computationally-intensive tasks, even when most of that computation is taking place in Java-land.  Anyway…

This script is fairly straightforward.  The one interesting landmark is the fact that it actually makes two separate passes over the wordlist file.  The first pass just counts the number of lines, while the second pass reads the words individually, converts them to lower-case and stores them in the bloom filter.  This two-pass approach allows us to calculate the optimal value for k in our bloom filter.  This contrasts with the Ruby implementation of WhatLanguage, which just uses 3 as a global default (the width is the same in both libraries).  According to my math, this per-language optimization actually ekes out an average 2-3% better accuracy in the bloom filters.  This means fewer false positives and nominally better language detection.

I say “nominally” because such a miniscule improvement in bloom filter performance actually makes very little difference in the grand scheme of things.  Most strings are decidedly one language or another, meaning that results should be essentially identical between the two implementations.  Regardless, it’s the thought that counts.  (right?)

Just for the curious, the script produces the following output when run:

  dutch
  ==========================
    Word count: 222908
    Optimal K: 6
    Actual K: 6
    Accuracy: 0.986554232499401

  english
  ==========================
    Word count: 234936
    Optimal K: 5
    Actual K: 5
    Accuracy: 0.9827068384240777

  farsi
  ==========================
    Word count: 339747
    Optimal K: 4
    Actual K: 4
    Accuracy: 0.9408664843803285

  french
  ==========================
    Word count: 629569
    Optimal K: 2
    Actual K: 2
    Accuracy: 0.7817441534106517

  german
  ==========================
    Word count: 298729
    Optimal K: 4
    Actual K: 4
    Accuracy: 0.9590696917792074

  pinyin
  ==========================
    Word count: 399
    Optimal K: 3470
    Actual K: 100
    Accuracy: 1.0

  portuguese
  ==========================
    Word count: 386393
    Optimal K: 3
    Actual K: 3
    Accuracy: 0.9148904670664727

  russian
  ==========================
    Word count: 951830
    Optimal K: 1
    Actual K: 1
    Accuracy: 0.6213162918878307

  spanish
  ==========================
    Word count: 595946
    Optimal K: 2
    Actual K: 2
    Accuracy: 0.7984358472107859

  swedish
  ==========================
    Word count: 54818
    Optimal K: 25
    Actual K: 25
    Accuracy: 0.9999999755910407

I’m not sure why Pinyin only has a vocabulary of 399 words, but that seems to be the way things are.  This means of course that we’re storing barely 400 elements in a bloom filter with a width of 2,000,000.  Needless to say, I wasn’t surprised to see that the optimal k was in the mid thousands.  Unfortunately, this is where our inefficient hash implementation comes back to bite us.  I discovered that if I were to allow a k value of 3470, even when only inserting 399 elements, the time required to process just the single language was upwards of 20 minutes.  This may have something to do with the fact that 34702 is very, very large.

To get around this problem, I cheat and cap the k value at 100 in the script.  This still produces a computed accuracy of 100% in the bloom filter, and it takes much less time to process.  A fast hash is actually even more important during lookups.  Considering that we have to check each token in a string against each and every supported language, a fast lookup is crucial to ensuring performance of the final product.  If it took even 10 ms to compute the hash of a single token when checking against the Pinyin language, the implementation would be completely unusable.

A Little API Design

Now that we have our word lists encoded as bloom filters, we can turn our attention to slightly more important problems; specifically: what do we want the API to look like?  I decided to base my Scala implementation primarily on Peter’s Ruby API.  However, there is one small wrinkle in this plan: WhatLanguage uses symbols for everything.  That’s fine and in line with Ruby idioms, and technically we could use Scala symbols if we really wanted to, but Scala is a statically typed language with a lot of powerful idioms of its own.  It would be a lot more conventional if we found a type-safe way of representing the same concepts.  To that end, I decided to go with a similar (but not identical) approach as exemplified in the following code snippet:

import com.codecommit.lang._
 
"Hello, this is a test".language       // => english
 
val wl = new WhatLanguage(english, french, spanish)
val text = "Bonjour, my name is Daniel. Estoy bien. Como estas? Êtes-vous ennuyer?"
 
wl.processText(text)   // => Map(English -> 4, French -> 5, Spanish -> 6)

The neat trick of the day is our use of static values in scope as a form of type-safe symbol.  We create a new instance of WhatLanguage, passing it three instances of class Language representing (oddly enough) the languages we wish it to use during the analysis.  We can also use “all” as a shortcut for enumerating every supported language.

Another cute little API feature is our use of the lang “package” to bring everything into scope, including two implicit conversions and a small boat-load of language values.  This is made possible by the fact that lang is not actually a package but a singleton object.  Even Scala does not allow values and functions as top-level elements in a package, but it does allow them within objects.  Our import at the head of the snippet is actually equivalent to the following bit of Java:

import static com.codecommit.lang.*;

With all of this in mind, the rest of the library is fairly easy to create.  All we need to do is design an algorithm which splits an input string into tokens, loops over each one keeping track of how many tokens are matched by each language, and finally selects the highest-scoring language from the result and returns its corresponding static value.  As one would expect, this is easily accomplished:

package com.codecommit
 
import java.io.BufferedInputStream
import com.codecommit.collection.BloomSet
 
import scala.collection.mutable.ArrayBuffer
 
object lang {
  case object all extends Language("all") {
    val langs = List(dutch, english, farsi, french, german, pinyin, 
                     portuguese, spanish, swedish)
 
    override lazy val words = BloomSet[String]()
 
    override val toString = "All languages meta-variable"
  }
 
  val dutch = Language("dutch")
  val english = Language("english")
  val farsi = Language("farsi")
  val french = Language("french")
  val german = Language("german")
  val pinyin = Language("pinyin")
  val portuguese = Language("portuguese")
  val russian = Language("russian")
  val spanish = Language("spanish")
  val swedish = Language("swedish")
 
  implicit def conversion(str: String) = new {
    val language = new WhatLanguage(all).language(str).getOrElse(null)
  }
 
  implicit def languageToString(lang: Language): String = lang.toString
 
  sealed case class Language private[lang] (file: String) {
    lazy val words = {
      val is = new BufferedInputStream(getClass.getResourceAsStream("/lang/" + file))
 
      try {
        BloomSet.load[String](is)
      } finally is.close()
    }
 
    override val toString = file(0).toUpperCase + file.substring(1)
 
    override def equals(other: Any) = other match {
      case lang: Language => toString == lang.toString
      case _ => false
    }
 
    override val hashCode = file.hashCode
  }
 
  class WhatLanguage(langs: Language*) {
 
    def language(str: String) = {
      val back = new ArrayBuffer[Language]
      var max = 0
 
      for ((lang, score) <- processText(str)) {
        if (score > max) {
          back.clear
 
          back += lang
          max = score
        } else if (score == max) {
          back += lang
        }
      }
 
      if (back.length == 1) Some(back(0)) else None
    }
 
    def processText(str: String) = {
      val langs = if (this.langs.contains(all)) all.langs else this.langs
      val prime = langs.foldLeft(Map[Language, Int]()) { _(_) = 0 }
 
      str.split("""\s+""").foldLeft(prime) { (map, token) =>
        langs.foldLeft(map) { (map, lang) =>
          if (lang.words.contains(token.toLowerCase)) {
            map(lang) = map(lang) + 1 
          } else map
        }
      }
    }
  }
}

Conclusion

Well, it’s been a fun and informative journey through the fertile land of bloom sets and simple statistical language analysis.  As usual, all of the source (and binaries) are available for download.  It’s a bit of a shame, but I didn’t get a chance to discuss a number of rather interesting features also available in my bloom filter implementation (such as concatenation of bloom filters).

Bloom filters are quite interesting in and of themselves, and definitely a useful technique that every developer should keep in their back pocket.  Peter Cooper definitely merits kudos for his implementation of WhatLanguage based on bloom filters, but despite the fact that I took the time to port it to Scala, I still don’t see much practical benefit.  At any rate, the library is now available in Scala; hopefully someone else will find it more useful than I do!

Software Transactional Memory in Scala

6
Oct
2008

The fact is that there are a lot of problems that are hard to solve in a purely-functional style.  That’s not to say that no solution exists, but certain problems are very difficult to model without shared state.  In such situations, a slightly different approach to concurrency must be considered.  Actors are inapplicable, seeing as they are the embodiment of “shared-nothing” continuation passing, and fork/join doesn’t really help us.  Usually, when faced with the need for shared mutable state, most developers will resort to the old-fashioned technique of locking everything and controlling access to that state one thread at a time.  The solution often goes something like this:

public class Container {
    private int value;
    private final ReadWriteLock lock = new ReentrantReadWriteLock();
 
    public int getValue() {
        lock.readLock().lock();
        try {
            return value;
        } finally {
            lock.readLock.unlock();
        }
    }
 
    public void setValue(int value) {
        lock.writeLock().lock();
        try {
            this.value = value;
        } finally {
            lock.writeLock.unlock();
        }
    }
}

Obviously, most scenarios which call for such locking are a bit more complex, but you get the picture.  There are two very serious problems with locking:

  • Implementation rules are ad hoc
  • Throughput is reduced

The first issue is obvious: there is nothing actually preventing us from accessing value without locking.  What’s worse, it would be just as easy to lock, access value and then forget to unlock when we were done.  There is no logical connection between value and its lock, the implementation rules are solely enforced by convention.  Of course, we can avoid the problem of forgetting to unlock by using a locking strategy which is lexical in nature, but in that case our second issue (throughput) becomes even more pronounced:

private int value;
private final Object lock = new Object();
 
public int getValue() {
    synchronized (lock) {
        return value;
    }
}
 
public void setValue(int value) {
    synchronized (lock) {
        this.value = value;
    }
}

The problem here is that everything blocks everything else.  If one thread wants to retrieve value, it literally stops the world while the other threads wait their turn.  Data integrity is guaranteed, but this was accomplished only by creating a serial bottleneck on read/write access to value.  On top of that, we still have no enforced correlation between the lock and its corresponding piece of data.

“Glass is Half Full” Concurrency

What we have been doing here is called “pessimistic locking”.  We have literally started with the assumption that we are going to run into data contention issues.  Our entire locking mechanism is designed with the worst case scenario in-mind: lots of threads trying to write to the same piece of data simultaneously.  However, not every concurrent system is going to have constant interference between threads.  As it turns out, in practice data contention is the exception rather than the rule.  Given this, it would be very nice if we could design a system which gave threads the “benefit of the doubt”, assuming that they will not conflict, while still somehow maintaining data integrity.

The answer to this is “optimistic locking”.  We turn the problem completely on its head.  Instead of assuming that there will be problems and so locking everything preemptively, we assume that everything will be fine and just allow everyone free, non-blocking access.  Of course, an “open-door policy” with regards to shared mutable state isn’t enough in and of itself, we have to have some rule for dealing with contention issues, and some way of detecting those conflicts when they happen.  Enter transactions…

The idea behind a purely-optimistic transactional memory model is that all write operations must occur within a transaction.  Data can be read any time, but the memory model will ensure that the data is never in an “intermediate state” - or between writes.  Let’s consider the more complicated scenario of the transfer of funds from one bank account to another.  The steps are as follows:

  1. Withdraw $500 from account A
  2. Deposit $500 into account B

Of course, this isn’t really an algorithm, it’s a high-level overview.  We’re really looking at something a bit more complicated than it would seem:

def transfer(amount: Int, a: Account, b: Account) {
  a.balance = a.balance - amount
  b.balance = b.balance + amount
}

I avoided using the += and -= operators so as to illustrate the problem more explicitly.  This operation has four separate operations on shared mutable state (the account balance).  It’s not too difficult to see how this could go terribly wrong in the case where two separate transfers are taking place simultaneously.  For example, we could transfer $500 from account A into account B, while at the same time we transfer $200 from account B into account C.  Remember, we’re not locking anything, so we run the danger that our concurrent execution order will be interleaved in the following fashion:

  1. [Thread1] Get balance of account A
  2. [Thread1] Set balance of account A to its former amount less 500
  3. [Thread1] Get balance of account B
  4. [Thread2] Get balance of account B
  5. [Thread1] Set balance of account B to its former amount plus 500
  6. [Thread2] Set balance of account B to its former amount less 200
  7. [Thread2] Get balance of account C
  8. [Thread2] Set balance of account C to its former amount plus 200

The operation in red is the one we need to be concerned about.  Thread2 retrieved the balance of account B just prior to when it was modified by Thread1.  This means that when Thread2 calculates the new balance (less 200), it will be basing its result on a now-obsolete balance.  When it sets the balance, the $500 that was transferred from account A will mysteriously vanish, the unfortunate victim of a common race condition.

In a transactional system, both transfers would be separately handled in their own transaction.  Neither of them would modify the other’s data until all operations are completed, at which point the transaction would commit and the data would become “live”.  Thus, when the Thread1 adds $500 to account B, the actual balance of account B will remain constant at its original value (prior to the transaction).  Once the transaction commits, both the balance of A and B will be updated essentially simultaneously.  There will never be a point where $500 is discovered missing “in transit” between two accounts.

This is only half the equation though.  The real genius of the transactional model is just prior to committing, a transaction validates itself, ensuring that all of the data it worked with is still in the same state it was at the beginning of the transaction.  If something has changed, then the transaction is invalidated and must be re-run automatically.  Our issue from above can never happen because the Thread2 transaction will attempt to validate itself, only to discover that the balance of account B has been changed in the meantime.  Rather than yield to the race condition, the Thread2 transaction will throw away its work and start from scratch.  Assuming that nothing else is running, the second validation will be successful (since nothing will have changed concurrently) and the transaction will commit.

It all sounds much more complicated than it actually is.  People familiar with modern databases like Oracle should already be comfortable working with optimistic transactional models.  The technique is a little less common in software, but it can still be applied.

In order to make this work in a conventional application setting, we need to introduce a few more abstractions.  There are several ways to go about this, but I have chosen to follow the model laid down by Rich Hickey in his implementation of Clojure’s STM.  In turn, Clojure seems to take a fair bit of inspiration from Haskell’s STM monad, although it does not port over concepts like transaction composition and what Simon Payton Jones calls “choice”.  Basically, the design can be distilled as follows:

  • Each item of shared state must be stored in a reference
  • References can be read at any point, but they can only be modified within a transaction
  • The data contained within a reference must itself be immutable, the reference simply allows you to switch the data it contains
  • Within a transaction, reading a reference does not return its current value, but rather the value it had precisely when the transaction began.  This allows data to change outside the transaction without disrupting its internal processing.
  • Changes made to a reference inside a transaction are not world-visible until the transaction completes and is committed.  Validation ensures that no data is lost during the commit.
  • Transactions must never contain side-effects as they may be executed multiple times

With all that in mind, let’s get to work on an implementation!

References

Since these are the primitive building blocks of our STM, it seems logical that we should start here.  At an extremely basic level, we will need a structure which looks something like the following:

class Ref[T](private var value: T) {
  def get = value
 
  def :=(value: T) {
    this.value = value
  }
}

This is the basic idea anyway.  We want to be able to retrieve a value from a reference using get, and we want to be able to store a value in the reference using the := operator (hat tip to all of you rabid Pascal fan-boys).  Unfortunately, we haven’t really accomplished anything here.  Yes, we have a reference to wrap around a mutable piece of data, but there are no associated concurrency semantics.  Remember the definition of a reference?  We must only be able to write to it within a transaction.  Furthermore, the value returned from a reference within a transaction is not necessarily its current, world-visible state, but rather the data it had the moment the transaction began.

In order to accommodate these requirements, we will introduce the concept of a context.  Both read and write access to a reference will require a context to be present.  We will have one context for the entirety of world-visible state.  Additionally, each transaction will have its own context.  In that way, we can ensure that transaction modifications are kept local to itself prior to commit while at the same time preventing changes from other transactions from becoming visible after the transaction has started (potentially leading to data integrity problems).  Our API has now evolved to something more like the following:

class Ref[T](private var value: T) {
  // ???
 
  def get(c: Context) = c.retrieve(this)
 
  def :=(value: T)(c: Context) {
    c.store(this)(value)
  }
}

I left the constructor undefined because of the fairly obvious problem in this implementation: how does the Context get the values in the first place?  For the moment, let’s put that problem aside and deal with a more interesting one: limiting write access to within transactions.  Recall that we’re going to have two different kinds of contexts: a live context which is global, as well as a context which is local to the transaction.  The requirement alone implies a way to statically restrict reference mutability to within transactions: require a type of context other than the live one.  To this end, we will derive the following closed hierarchy.  It is closed because there will be no other Context implementations, preventing over-zealous API extensions from fouling up our semantics.

image

We will define Context to be an abstract class and LiveContext to be a singleton object.  Each transaction will have its own Transaction context which it will use in both read and write operations.  LiveContext will only be used outside of a transaction, when there is no other context available.  To enforce this, we will restrict the type of the Context taken by the reference assignment operator to only accept Transaction:

def :=(value: T)(c: Transaction) {
  c.store(this)(value)
}

With this in mind, we can start to envision what the syntax for reference operations will look like:

def transfer(amount: Int, a: Ref[Int], b: Ref[Int])(c: Transaction) {
  a.:=(a.get(c) - amount)(c)
  b.:=(b.get(c) + amount)(c)
}
 
// ...
val accountA = new Ref[Int](1500)
val accountB = new Ref[Int](200)
 
// somehow call `transfer` as a transaction
println("Account A: " + accountA.get(LiveContext))
println("Account B: " + accountB.get(LiveContext))

And just when our design was looking so nice too.  We’ve succeeded in preventing at compile time the modification of references outside of transactions, but we did it at the cost of a tyrannical syntax.  Fortunately, Scala has a handy mechanism for cleaning up syntax such as this, one which should reduce the volume of annoying bulk by several orders of magnitude: implicits.

IScala makes it possible to mark parameters as accepting implicit values.  These parameters in turn become implicit values themselves.  When a method which accepts an implicit parameter of a specific type is called with an implicit value of the same type in scope, the parameter can be omitted entirely.  Thus, by marking LiveContext as an implicit object and appropriately annotating the last parameter of transfer as well as the Context parameters of Ref’s accessor and mutator, we can eliminate almost all of the annoying bulk in the above example:

implicit object LiveContext extends Context {
  ...
}
 
class Ref[T](value: T) {
  // ???
 
  def get(implicit c: Context) = c.retrieve(this)
 
  def :=(value: T)(implicit c: Transfer) {
    c.store(this)(value)
  }
}

With these extra modifiers in place, we can redo our transfer snippet to see how things look.  To cut down on line length, we will also assume that accountA and accountB are references in some sort of global scope:

val accountA = new Ref[Int](1500)
val accountB = new Ref[Int](200)
 
def transfer(amount: Int)(implicit t: Transaction) {
  accountA := accountA.get - amount
  accountB := accountB.get + amount
}
 
// somehow call `transfer` as a transaction
println("Account A: " + accountA.get)
println("Account B: " + accountB.get)

Pretty slick, and we aren’t even finished yet!  We can add an implicit conversion from Ref[T] to T, eliminating the need to call get anywhere in the code:

implicit def refToValue[T](ref: Ref[T])(implicit c: Context) = {
  ref.get(c)
}

…and the final example syntax:

val accountA = new Ref[Int](1500)
val accountB = new Ref[Int](200)
 
def transfer(amount: Int)(implicit t: Transaction) {
  accountA := accountA - amount
  accountB := accountB + amount
}
 
// somehow call `transfer` as a transaction
println("Account A: " + accountA)
println("Account B: " + accountB)

With the exception of the := syntax (which is unfortunately unavoidable), you would never be able to tell that references are being used rather than conventional vars.  Even better, we have managed to preserve our static assurance that no reference may be modified outside of a transaction.  If we were to attempt to call the := method without an instance of Transaction on-hand, the Scala type checker would complain and our code would not compile.  The we need to do to close the loop is to make sure that the only place a Transaction instance can be obtained is inside a transaction (seems logical).

Atomic

For the moment, let’s shelve all of the implementation requirements for the STM and instead focus on the API.  We already have an almost elegant syntax for references, but we’re still missing one final piece: initiating the transaction.  Being incredibly imaginative and innately creative, I decided the best way to devise an API for this would be to hit Wikipedia.  After all, why come up with something original when someone smarter has already solved the problem?

 // Insert a node into a doubly-linked list atomically
 atomic {
     newNode->prev = node;
     newNode->next = node->next;
     node->next->prev = newNode;
     node->next = newNode;
 }

 atomic (queueSize > 0) {
     remove item from queue and use it
 }

Credit: Software transactional memory # Proposed language support

It’s a fairly straightforward API, one which could be expressed in Scala as it stands if we really wanted to.  Obviously the body of the first transaction is a little closer to C than Scala, but we can just take that as example pseudo-code and move on.  The real core of the API is a single method, atomic, which takes a function value and executes that function as a transaction.  If a conditional is provided, it acts as a guard.  If the guard ever returns false, the transaction aborts and is not retried.

The only problem here is we haven’t accounted for our implicit Transaction parameter.  Intuitively, we could just add a parameter of type Transaction to the function value, but unfortunately Scala doesn’t allow anonymous functions with implicit parameters.  That leaves one of two options: either take the Transaction as a parameter to the anonymous function and then store it in an implicit value within the function; or pass an actual method which takes an implicit parameter.  In truth, either approach will work, but for the remainder of the article I will use the method-passing approach, rather than the separate assignment within the anonymous method.

Altogether, our API in action looks something like this:

val accountA = new Ref[Int](1500)
val accountB = new Ref[Int](200)
 
def transfer(amount: Int)(implicit t: Transaction) {
  accountA := accountA - amount
  accountB := accountB + amount
}
 
atomic(transfer(500)(_))    // run transaction
 
println("Account A: " + accountA)
println("Account B: " + accountB)

Notice the extra (_) syntax in the call to atomic.  (thanks, Jesper!)  This is required because transfer accepts an implicit parameter.  Without it, Scala doesn’t know whether we mean to call the method using some implicit value in scope or if we want the function value itself.

Remember this API.  We’ll come back to the implementation of this method after we have completed the rest of the framework.  For now, let’s move onto context…

Context

Returning to the dark and mysterious internals of the implementation, we now come to the deep morass of Context.  As it turns out, everything in the STM will revolve around this class and its two separate implementations.  Recall that it is responsible for retrieving reference values, controlling what data is visible from within a transaction and what data is live.  Generally speaking, we are going to have the following design with respect to where reference data is handled:

  • If in LiveContext, access the One True copy of the data.
  • If in Transaction, access the One True copy if and only if there is no transaction-local version.  Once the reference has been read, cache the data and return that value from all future reads within the transaction.  We’re cheating a bit here since we aren’t taking a snapshot of the world on transaction start, we’re waiting for the first reference read.  As long as we get our validation right, this should be ok.

Logically, this is a simple default operation clouded by a fairly substantive special case (within a transaction after the reference has been read or written once).  To handle the special case, it only makes sense that we use a map from reference to the data to which it corresponds in that transaction.  Remember that the data within a transaction may be a bit behind the live copy.

We could use another map for the general case, but we can make things even simpler than that.  Rather than having a global map from reference to value, we can just store the live values within the Ref objects themselves.  They will still nee