Skip to content
Print

Bloom Filters in Scala (and all the fun that they bring)

13
Oct
2008

This is a post I have been meaning to write for quite some time.  Despite being an officially excommunicated former-member of the Ruby community, I still like to keep up on the latest developments.  A few months ago, my Ruby Inside feed produced an interesting post introducing a new library by Peter Cooper (the author of Ruby Inside).  Unlike the most “new” Ruby libraries which are most often just bad clones of Rails snibblets, Peter’s WhatLanguage seemed to be quite unique and interesting.  For the first time in years, I was reading about an innovation in Ruby-land which actually sounded quite exciting.  A good feeling to be sure.

In a nutshell, WhatLanguage (project page at GitHub) is a statistical language identification API which analyses arbitrary input String(s), scoring them against a set of built-in languages (including English, French, Spanish, Russian, and others).  Add to that a little bit of meta-programming magic and the possibilities are eye-catching to say the least:

require 'whatlanguage'
 
'Here is some text to identify'.language       # => :english
'¿Hola, pero donde está su pelo?'.language     # => :spanish

To be honest, I’m not sure what practical application I would have for such a library, but it sure looks like fun!  :-)   Under the surface, the implementation is deceptively simple.  WhatLanguage bundles a set of words for each language it supports.  The implementation then searches these sets for each word in the string, aggregating a score for each language depending on whether or not the word was found.  If exactly one language has a higher score than the others once the end of the string is reached, it is “declared the winner” and returned as a result.  Otherwise, the analysis is undecided and nil is returned.  Simple, right?

The tricky part is somehow storing and searching those gigantic language word sets in an efficient manner.  As it turns out, this isn’t so easy.  Just stop and consider the number of words which exist in your native tongue.  Unless you speak a very strange, primarily-tonal language, the number is likely in the upper hundreds of thousands.  Just going off of Peter’s word lists, Russian is the real trouble maker, weighing in with a nearly one million word vocabulary.  Any attempt to store that much data would require megabytes of memory and several minutes to process, even on a fast machine.  Any sort of hash set is completely out of the question due to the nature of contiguous memory allocation, but can you imagine trying to perform a search traversal on a tree with almost a million nodes?  Obviously, a more effective solution is required.

Probabilistic Sets

So far we have been focusing our imaginary efforts on data structures which guarantee 100% accuracy and full reversibility.  That is to say, data structures which allow you to get out what you put in and which ensure that if something is “in” it will be found and if something is “out” it will not.  After all, is a data structure without these guarantees really all that useful?

As it turns out, the answer is “yes”.  Consider a standard hash set as an example.  Sets make absolutely no guarantee about the ordering of their elements.  If you attempt to iterate over a set, you will eventually reach everything inside, but you cannot rely on a certain element coming before the others.  Of course, the advantage we gain by making this trade-off is constant time searching by item identity.  Lists cannot achieve this performance, neither can arrays.  If you want to search either one of these structures, you must look through every element from start to finish.  So with hash sets, we have effectively traded a fundamental guarantee about our data for efficiency in some other respect.

The same is true for probabilistic sets.  We don’t actually need all of the properties of a set to implement WhatLanguage.  For example, we don’t need to actually retrieve a word from our dictionary, we only need to discover whether or not that word is in the set.  In fact, we really only need our data structure to have two important properties:

  • Lightning fast boolean searches (does the set contain this element?)
  • Minimal memory footprint

Hash sets can make the first guarantee, but as we have already noticed, they don’t really require any less memory than just storing everything in an array.  Since we can’t afford one million words-worth of memory per language, we need to come up with some way of storing information about the presence of an element without retaining the data it represents.  In order to do this, we have to make one important sacrifice: accuracy.

We basically need to accept the fact that our data structure may sometimes return inaccurate results.  As it turns out though, this is just fine.  Remember, we’re scoring languages one token at a time; an incorrect answer now and again isn’t going to kill us.  Given a large enough input string, a minimal percentage of inaccuracy will literally be lost in the noise.

As it turns out, there already exists a very old technique for accomplishing exactly these requirements.  Bloom filters have been around since the beginning of time (or at least the 1980s), but unfortunately very few developers are aware of them.  In a nutshell, they precisely satisfy all of the needs we just finished enumerating.  They provide true (not amortized) constant-time searches and will consistently return true or false for a given element, depending on whether or not the value is in the set.  Occasionally they will give false positives (claiming that an element is in the set when in fact it is not), but never false negatives.  This property is extremely useful in many applications, but it turns out that this is even more than we need for WhatLanguage.

Most importantly though, bloom filters are extremely compact in memory.  Rather than storing an entire value, they hash every element to one or (usually) more indexes in an array of bits (0 or 1).  In an optimal bloom filter guaranteeing no false positives, each element would correspond to precisely k bits of memory (where k is the number of hash functions, usually in the vicinity of 3 or 4).  Once this hashing is complete, the value itself may be discarded.  Thus, assuming an optimal bloom filter with 3 hash functions, the storage requirements for one million words would only be 375 KB of memory.  Compare this to the 8 MB required just to store the pointers to each of the words, much less the values themselves!  If we include the underlying character values, we start pushing closer to 20 MB.

375 KB for a million elements is impressive, but bloom filters can actually do a lot better than that.  Remember, we agreed to accept some false positives.  It’s easy to imagine how even an ideal bloom filter (such as the one we have been analyzing) could have some inaccuracy, but if we’re willing to live with a few more false positives, we can allow hash collisions and drop our memory requirements into the 200 KB range.  Not bad, considering how much data we’re storing!

Bloom Filters

So how does the magic actually work?  We’ve hinted at the implementation in our analysis of memory requirements, but we’re still missing some important details.  Consider the following diagram (courtesy of Wikipedia):

Under the surface, a bloom filter is just an array of boolean values.  The length of the array is referred to as the width of the bloom filter.  When the filter is empty, all of the array values are set to false.  To insert an element, we compute k different hash indexes in the array and set them all to true.  The real beauty of the data structure is the manner in which we handle collisions: we don’t worry about it.  Whenever one (or more) of the hash functions returns an index which is already true, we simply leave that index as-is and move on.  Thus, the procedure for inserting an element into a “full” bloom filter is precisely the same as inserting into an empty one: set all of the hash indexes to true.  In a sense, bloom filters can never be full, they just get less and less accurate.

Bloom filters define only one operation other than insertion: searching.  To determine whether or not a value is in the set, we run all of our hash functions and find the indexes to which it corresponds.  If all of these indexes are true, then we can assume that the value is indeed in the set.  Conversely, if any one of indexes is false, then the value cannot be in the set since we would have set that value to true upon insertion.

It is important to note that removal is not supported by bloom filters.  Intuitively we could just take the value to be removed, calculate its hash indexes and then set the corresponding elements to false.  However, if any other values also hash to even one of those locations, we will be effectively removing it from the bloom filter “by accident”.  Subsequent searches for that value will hash to the index which we erroneously set to false, resulting in a false negative on the search (saying that an element is not in the set when in fact it is).  Fortunately for us, WhatLanguage only requires insertion and searching for values, removal is not on the agenda.

Scala Implementation

Well, this has probably been a nostalgic (for those of you who enjoyed your university days) stroll through the land of data structure theory, but we’re still lacking one critical factor: a concrete implementation.  Bloom filters don’t sound that complicated on paper, but to do anything useful with them we’re going to need some code.

Before we dive in, we might want to define some more precise requirements for our implementation.  We’re writing this in Scala, so it would be nice to implement bloom filters as a functional data structure.  I’m not going to get into all the benefits of immutable data structures; suffice it to say, I really think that they’re the only way to go for most serious applications.  Unfortunately, in order to do this, we’re going to need some sort of immutable, array-like data structure under the surface which provides random access in constant time.  Sound familiar?

This is actually where the whole saga of persistent vectors in Scala found its roots.  The first time I tried to implement a functional bloom filter in Scala I made use of a conventional array under the surface with copy-on-write semantics.  As you can imagine, this is hideously inefficient in both space and time.  In order to solve the problem, I had to implement a data structure which was just as efficient for reading and writing, but which also shared structure with subsequent versions.  Now, two months later, I’ve borrowed just such a design from Clojure and created a first-class implementation in Scala.

With a persistent vector now at our disposal, we can easily implement a persistent bloom filter by using a vector to represent our boolean indexes.  Every time we perform an insertion, we will “modify” the relevant indexes, creating a new instance of the vector.  This vector can then be used in our “modified” bloom filter, returned as a new instance of our container class.  Since persistent vectors share unmodified structure between versions, so does our bloom filter.  All of the hard work of efficient immutability is already done for us, we just need to implement a thin facade.  We start out by defining the class skeleton:

import BloomSet._
 
class BloomSet[A] private (val size: Int, val k: Int, 
                           private val contents: Vector[Boolean]) extends ((A)=>Boolean) {
  val width = contents.length
 
  def this(width: Int, k: Int) = this(0, k, alloc(width))
}
 
object BloomSet {
  def alloc(size: Int) = {
    (0 until size).foldLeft(Vector[Boolean]()) { (c, i) => c + false }
  }
}

We’ll keep all of our utility functions in the companion object.  There’s really no need to put them in the main class, and the JVM is able to optimize dispatch to static members to a slightly higher degree than it can instance members (invokestatic).  Technically, we don’t really need the alloc function, it just hides some of the boiler-plate involved in initializing a vector to a certain length.  Note that this operation is extremely fast, even for high values of width.  Despite the fact that we are creating a huge number of Vector objects, very little memory is actually wasted.  Even at this early stage in the implementation, the advantages of persistence are evident.

A Little Math

You’ll notice that our bloom filter has three major properties: size, k, and width.  As previously mentioned, k is the number of hash functions to be used, while width is the length of the array used to store the boolean values.  These values have to be user-configurable because different applications will call for different optimizations.  For example, implementing WhatLanguage requires the storage of almost a million elements per set.  Obviously we would not want to attempt to use a bloom filter which only had a width of (say) 200.  This wouldn’t cause any errors per se (remember, bloom filters are never “full”), but the accuracy of such a filter would be almost nil.  With a million elements all hashed into a mere 200 indexes, it is likely that that every single index would have a value of true.  Such a bloom filter would return true when searching for any value, including those which were not inserted.  We said we would tolerate some false positives, but we still want the results to be meaningful.  Since the width of a bloom filter cannot be changed once data is inserted (at least, not without “losing” data), the user must be able to decide in advance how much space will be required and then instantiate BloomSet accordingly.

The value of k is an even more interesting proposition.  The intuition is that the larger our value of k (thus, the more hash functions we use), the greater the accuracy of our bloom filter and the fewer false positives we will return.  This is technically true when the bloom filter does not contain very many elements, but as more and more values are inserted, higher values of k become problematic.  Wikipedia has an excellent (and extremely readable) mathematical analysis of this and other properties of bloom filters.  As it turns out, the optimal value for k (one which will yield the fewest false positives) can be computed by the following expression (where m is the width and n is the size, or the number of elements contained in the filter):

k = \mbox{ln}(2) \; \frac{m}{n}

For the sake of quick-computation, this is essentially equivalent to:

k \approx \frac{9m}{13n}

The equivalence isn’t precise (logarithms are almost always irrational), but since we’re already rounding the result to the nearest integer, we don’t need to be too concerned.

While it would be nice if we could just auto-magically set k to the optimum value and be done with it, that just isn’t the case.  Like width, k cannot be changed after data has been inserted.  Unfortunately, its optimum value is dependent on the number of items which will eventually be in the set.  Since we don’t really “know” this value within the data structure until after the insertions are complete, there is no way for us to self-optimize.  To provide maximum configurability, we need to allow this value to be set by the API consumer on a case-by-case basis.

As an interesting aside, we can actually compute the accuracy of a bloom filter on a scale from 0 to 1 given values for width, k and size.  In this case, accuracy is defined as the odds that a search result is indeed accurate (the chance of it not being a false positive).  Wikipedia gives the following expression for this accuracy:

1 – (1 – e^{-kn / m})^k

Thanks to the marvels of approximation, this is a fairly efficient computation on modern platforms.  Translated into Scala, it looks something like this:

lazy val accuracy = {
  val exp = ((k:Double) * size) / width
  val probability = Math.pow(1 - Math.exp(-exp), k)
 
  1d - probability
}

Core Functionality

As interesting as this minor detour has been, we still haven’t arrived at a working implementation of the bloom filter data structure.  We have the basic structure in place, but we still need to implement the two core operations: insert and search, rendered in Scala as + and contains:

def +(e: A) = new BloomSet[A](size + 1, k, add(contents)(e))
 
def contains(e: A) = {
  (0 until k).foldLeft(true) { (acc, i) => 
    acc && contents(hash(e, i, width)) 
  }
}
 
protected def add(contents: Vector[Boolean])(e: Any) = {
  var back = contents
 
  for (i <- 0 until k) {
    back = back(hash(e, i, width)) = true
  }
 
  back
}

Feels sort of anti-climactic doesn’t it?  After all that analysis, we find that it is possible to implement the essence of a bloom filter in less than 17 lines of code.  Technically, I probably could trim this down to more like 8 lines, but this isn’t an exercise in Scala golf.  These three methods are just the materialization of all of the hand-wavy descriptions that we’ve spent the last five pages working through.  Each time we insert an element, we calculate k different hash values and set the appropriate vector indexes to true.  To find that element, we calculate the same k hashes and ensure that all of the indexes have an appropriate boolean value.

Hashing

Those of you who are actually reading the code snippets rather than just blindly trusting my veracity will probably notice that I’ve omitted a fairly important function from the above sample; specifically: hash.  Somehow, we need to define a function which will simulate the existence of a potentially unlimited number of hash functions.  Based on our usage of this function, we need to be able to grab a hash value for a given value and function index, restricted to a given width.  This is not a trivial problem.

As a good developer, you should already be trying to think up ways to reduce the code involved in implementing these requirements.  I mean, we could potentially just dream up a few dozen different hash functions with a fairly wide distribution of results.  However, I don’t know about you, but I really don’t have that much imagination.  I have a hard enough time developing just one solid hash function, much less dozens.  What’s more, if each hash function corresponds with a different method, then we can’t really scale the number of hash functions to any integer k.

A better approach would be to create a single hash function and then skew its value n times for every n between 0 and k.  Like most things that have to do with hashing, it’s cheating, but providing we do our job right, it should give us a fairly decent way of computing the required k hash functions.  Of course, the key phrase there is “do our job right”.  We need to make sure that we skew the initial hash value sufficiently for each successive hash, otherwise successive hash functions might get “stuck in a groove”.  For example, let’s assume that each successive hash function is equivalent to the previous hash + 2.  If we then inserted two elements into the bloom filter with k = 3, there would be the potential that the hash(0) for the first element would be precisely 2 greater than the hash(0) for the second.  If this is the case, then hash(1) for the second would equal hash(0) for the first.  Thus, the first and second elements would collide on two out of three indexes.  Not exactly an optimal distribution.

To avoid this problem, we should be sure to skew the hash value by a different amount for each hash function.  A convenient way of doing this is to use the number of iterations, or the index of the particular hash function.  To further improve the distribution, we will also use the bitwise XOR operation rather than simple addition.  It’s not really intuitively obvious, but XOR is mathematically far superior for hashing and just as cheap in terms of processor cycles.  Put it all together, and the implementation looks something like this:

def hash(e: Any, iters: Int, bounds: Int): Int = {
  Math.abs(
    if (iters == 0) e.hashCode 
    else iters ^ hash(e, iters - 1, bounds)
    ) % bounds
}

The final bit of math in this function mods our skewed hash value with the given bounds.  Whenever we call hash, we always pass the width of the bloom filter.  This is a trick that you’ll see a lot in any hash-based data structure.  Essentially, we’re just mapping our hash value into the required domain.  Hash values can be anything between -231 and 231 – 1, but our vector indexes can only be between 0 and width - 1.  Just a rule of thumb, but any time you have a problem where you must “map values from a large domain to a more restricted one”, mod is generally the correct approach.

Observant readers will notice that our implementation of hash is a little inefficient.  Actually, it’s worse than just a “little” inefficient: it’s downright terrible.  Remember that we have to call this method once for every integer n between 0 and k.  This isn’t so bad, since k is usually comparatively small, but the real problem is our sneaky use of recursion within hash.  The time complexity of hash is precisely O(n), where n is the number of iterations.  Once you factor this in with the number of times we call this function and it is easy to see how the total complexity of computing k hash values is actually O(k2).  Thus, for k = 4, we actually call hash 16 times!  You can imagine how this could get very problematic for higher values of k.

Technically, we could rewrite add and contains to compute the hash values incrementally from the bottom up.  This would bring the hashing complexity down to O(k).  However, this is a bit more elegant in terms of code, not to mention easier to talk about in a modular fashion.  In practice, this inefficiency is only a problem for higher values of k due to the fact that computing a single-step hash value is actually a very fast operation.  We will run into trouble in a little bit though when we attempt to optimize the bloom sets for shorter wordlists in WhatLanguage.

Serialization

One of the important requirements of WhatLanguage that we haven’t really touched on so far is the ability to store freeze-dried bloom filters on disk and then suck them back into memory quickly and efficiently.  If you think about it, this requirement only makes sense since one of our primary motivations from the start was to avoid the burdens associated with huge word lists.  After all, if our in-memory representation of a language is only a few hundred kilobytes, shouldn’t the on-disk representation be no worse?  Furthermore, if we had to parse a wordlist into a bloom filter every time we wanted to check a string’s language, the performance imposition would be utterly prohibitive.

Fortunately, there is a fairly obvious format which we could use to efficiently serialize an arbitrary bloom filter.  Recall that the real heart of a bloom filter is the array of boolean flags, an array which can also be represented as an array of bits.  We could very easily select these bits eight at a time, combine them together using bit-shifting and produce a byte value which could be sent to an OutputStream.  Thus, a bloom filter with an array of {true, false, true, true, true, false, false, true, false} would be stored as the following series of bits: 101110010, which in hexadecimal byte form becomes B9 00.

In addition to the array, we also need to store the size, k, and width of the bloom filter, otherwise it will be impossible to find anything reliably in the deserialized result.  The Scala store and load methods to accomplish this are verbose, but fairly straightforward:

class BloomSet[A] ... {
  ...
 
  def store(os: OutputStream) {
    os.write(convertToBytes(size))
    os.write(convertToBytes(k))
    os.write(convertToBytes(contents.length))
 
    var num = 0
    var card = 0
    for (b <- contents) {
      num = (num << 1) | (if (b) 1 else 0)    // construct mask
      card += 1
 
      if (card == 8) {
        os.write(num)
 
        num = 0
        card = 0
      }
    }
 
    if (card != 0) {
      os.write(num)
    }
  }
}
 
object BloomSet {
  def load[A](is: InputStream) = {
    val buf = new Array[Byte](4)
 
    is.read(buf)
    val size = convertToInt(buf)
 
    is.read(buf)
    val k = convertToInt(buf)
 
    is.read(buf)
    val width = convertToInt(buf)
 
    var contents = Vector[Boolean]()
    for (_ <- 0 until (width / 8)) {
      var num = is.read()
      var buf: List[Boolean] = Nil
 
      for (_ <- 0 until 8) {
        buf = ((num & 1) == 1) :: buf
        num >>= 1
      }
 
      contents = contents ++ buf
    }
 
    if (width % 8 != 0) {
      var buf: List[Boolean] = Nil
      var num = is.read()
 
      for (_ <- 0 until (width % 8)) {
        buf = ((num & 1) == 1) :: buf
        num >>= 1
      }
 
      contents = contents ++ buf
    }
 
    new BloomSet[A](size, k, contents)
  }
 
  ...
}

It is interesting to note that store is a good example of what David MacIver meant when he says that functional code image good code.  I actually tried to write this in a functional style to start with, but I gave up after it became horribly ugly (having four nested folds is never a good sign).  The imperative rendering (in this case) is concise and elegant, one of the many advantages of a hybrid languages like Scala.

This particular format for rendering a bloom set is almost precisely identical to the format used by Peter Cooper’s Bloomin’ Simple, the library which sits underneath WhatLanguage.  However, due to hashing differences it is impossible to use his serialized bloom filters in Scala or vice versa.

WhatLanguage in Scala

So now that we have our magic bloom filter ready and waiting, it’s time to take a crack at Peter’s fancy language identifier!  As it turns out, this is a lot simpler than the bloom filter itself.  Conceptually, all we have to do is load a different bloom filter for each language and then use that bloom filter to check each input string token individually.  For the sake of uniformity, we will convert all tokens to lower-case.

Of course, before we can load the language bloom filters we need to first store them, and we cannot store a bloom filter before it is generated.  To that end, we need to create a simple Scala script (yes, a script) which reads in a file of newline-delimited words, inserts them all into an instance of BloomSet and then stores the result in a corresponding language file.  Scala’s scripting functionality is actually quite good, especially for a statically typed language, and really deserves some better recognition.  The complete language generator script is as follows:

import com.codecommit.collection.BloomSet
import java.io.{BufferedOutputStream, File, FileOutputStream}
import scala.io.Source
 
val WIDTH = 2000000
 
def computeK(lines: Int) = (((9:Double) * WIDTH) / ((13:Double) * lines)).intValue
 
for (file <- new File("wordlists").listFiles) {
  if (!file.isDirectory) {
    println(file.getName)
    println("==========================")
 
    val src = Source.fromFile(file)
    val count = src.getLines.foldLeft(0) { (i, line) => i + 1 }
    println("  Word count: " + count)
 
    val optimalK = computeK(count)
    val init = new BloomSet[String](WIDTH, Math.min(optimalK, 100))
 
    println("  Optimal K: " + optimalK)
    println("  Actual K: " + init.k)
 
    val set = src.reset.getLines.foldLeft(init) { _ + _.trim }
 
    println("  Accuracy: " + set.accuracy)
 
    val os = new BufferedOutputStream(
        new FileOutputStream("src/main/resources/lang/" + file.getName))
    set.store(os)
    os.close()
 
    println()
  }
}

I originally created this script using JRuby, thinking of course that it would be much easier to perform a simple, one-off task like this in a dynamically typed language.  Interestingly enough, the JRuby version was about twice as long and actually took dramatically longer to run.  By “dramatically” I mean on the order of seventy times longer.  The above Scala script takes just over 30 seconds to run on my machine using the wordlists from WhatLanguage.  This stands in stark contrast to the 35 minutes required for the JRuby script.  Both scripts use the same underlying data structure to perform all the work (BloomSet), so it’s really hard to claim that one implementation was fundamentally slower than the other.  In short: JRuby does not seem terribly well-suited for computationally-intensive tasks, even when most of that computation is taking place in Java-land.  Anyway…

This script is fairly straightforward.  The one interesting landmark is the fact that it actually makes two separate passes over the wordlist file.  The first pass just counts the number of lines, while the second pass reads the words individually, converts them to lower-case and stores them in the bloom filter.  This two-pass approach allows us to calculate the optimal value for k in our bloom filter.  This contrasts with the Ruby implementation of WhatLanguage, which just uses 3 as a global default (the width is the same in both libraries).  According to my math, this per-language optimization actually ekes out an average 2-3% better accuracy in the bloom filters.  This means fewer false positives and nominally better language detection.

I say “nominally” because such a miniscule improvement in bloom filter performance actually makes very little difference in the grand scheme of things.  Most strings are decidedly one language or another, meaning that results should be essentially identical between the two implementations.  Regardless, it’s the thought that counts.  (right?)

Just for the curious, the script produces the following output when run:

  dutch
  ==========================
    Word count: 222908
    Optimal K: 6
    Actual K: 6
    Accuracy: 0.986554232499401

  english
  ==========================
    Word count: 234936
    Optimal K: 5
    Actual K: 5
    Accuracy: 0.9827068384240777

  farsi
  ==========================
    Word count: 339747
    Optimal K: 4
    Actual K: 4
    Accuracy: 0.9408664843803285

  french
  ==========================
    Word count: 629569
    Optimal K: 2
    Actual K: 2
    Accuracy: 0.7817441534106517

  german
  ==========================
    Word count: 298729
    Optimal K: 4
    Actual K: 4
    Accuracy: 0.9590696917792074

  pinyin
  ==========================
    Word count: 399
    Optimal K: 3470
    Actual K: 100
    Accuracy: 1.0

  portuguese
  ==========================
    Word count: 386393
    Optimal K: 3
    Actual K: 3
    Accuracy: 0.9148904670664727

  russian
  ==========================
    Word count: 951830
    Optimal K: 1
    Actual K: 1
    Accuracy: 0.6213162918878307

  spanish
  ==========================
    Word count: 595946
    Optimal K: 2
    Actual K: 2
    Accuracy: 0.7984358472107859

  swedish
  ==========================
    Word count: 54818
    Optimal K: 25
    Actual K: 25
    Accuracy: 0.9999999755910407

I’m not sure why Pinyin only has a vocabulary of 399 words, but that seems to be the way things are.  This means of course that we’re storing barely 400 elements in a bloom filter with a width of 2,000,000.  Needless to say, I wasn’t surprised to see that the optimal k was in the mid thousands.  Unfortunately, this is where our inefficient hash implementation comes back to bite us.  I discovered that if I were to allow a k value of 3470, even when only inserting 399 elements, the time required to process just the single language was upwards of 20 minutes.  This may have something to do with the fact that 34702 is very, very large.

To get around this problem, I cheat and cap the k value at 100 in the script.  This still produces a computed accuracy of 100% in the bloom filter, and it takes much less time to process.  A fast hash is actually even more important during lookups.  Considering that we have to check each token in a string against each and every supported language, a fast lookup is crucial to ensuring performance of the final product.  If it took even 10 ms to compute the hash of a single token when checking against the Pinyin language, the implementation would be completely unusable.

A Little API Design

Now that we have our word lists encoded as bloom filters, we can turn our attention to slightly more important problems; specifically: what do we want the API to look like?  I decided to base my Scala implementation primarily on Peter’s Ruby API.  However, there is one small wrinkle in this plan: WhatLanguage uses symbols for everything.  That’s fine and in line with Ruby idioms, and technically we could use Scala symbols if we really wanted to, but Scala is a statically typed language with a lot of powerful idioms of its own.  It would be a lot more conventional if we found a type-safe way of representing the same concepts.  To that end, I decided to go with a similar (but not identical) approach as exemplified in the following code snippet:

import com.codecommit.lang._
 
"Hello, this is a test".language       // => english
 
val wl = new WhatLanguage(english, french, spanish)
val text = "Bonjour, my name is Daniel. Estoy bien. Como estas? Êtes-vous ennuyer?"
 
wl.processText(text)   // => Map(English -> 4, French -> 5, Spanish -> 6)

The neat trick of the day is our use of static values in scope as a form of type-safe symbol.  We create a new instance of WhatLanguage, passing it three instances of class Language representing (oddly enough) the languages we wish it to use during the analysis.  We can also use “all” as a shortcut for enumerating every supported language.

Another cute little API feature is our use of the lang “package” to bring everything into scope, including two implicit conversions and a small boat-load of language values.  This is made possible by the fact that lang is not actually a package but a singleton object.  Even Scala does not allow values and functions as top-level elements in a package, but it does allow them within objects.  Our import at the head of the snippet is actually equivalent to the following bit of Java:

import static com.codecommit.lang.*;

With all of this in mind, the rest of the library is fairly easy to create.  All we need to do is design an algorithm which splits an input string into tokens, loops over each one keeping track of how many tokens are matched by each language, and finally selects the highest-scoring language from the result and returns its corresponding static value.  As one would expect, this is easily accomplished:

package com.codecommit
 
import java.io.BufferedInputStream
import com.codecommit.collection.BloomSet
 
import scala.collection.mutable.ArrayBuffer
 
object lang {
  case object all extends Language("all") {
    val langs = List(dutch, english, farsi, french, german, pinyin, 
                     portuguese, spanish, swedish)
 
    override lazy val words = BloomSet[String]()
 
    override val toString = "All languages meta-variable"
  }
 
  val dutch = Language("dutch")
  val english = Language("english")
  val farsi = Language("farsi")
  val french = Language("french")
  val german = Language("german")
  val pinyin = Language("pinyin")
  val portuguese = Language("portuguese")
  val russian = Language("russian")
  val spanish = Language("spanish")
  val swedish = Language("swedish")
 
  implicit def conversion(str: String) = new {
    val language = new WhatLanguage(all).language(str).getOrElse(null)
  }
 
  implicit def languageToString(lang: Language): String = lang.toString
 
  sealed case class Language private[lang] (file: String) {
    lazy val words = {
      val is = new BufferedInputStream(getClass.getResourceAsStream("/lang/" + file))
 
      try {
        BloomSet.load[String](is)
      } finally is.close()
    }
 
    override val toString = file(0).toUpperCase + file.substring(1)
 
    override def equals(other: Any) = other match {
      case lang: Language => toString == lang.toString
      case _ => false
    }
 
    override val hashCode = file.hashCode
  }
 
  class WhatLanguage(langs: Language*) {
 
    def language(str: String) = {
      val back = new ArrayBuffer[Language]
      var max = 0
 
      for ((lang, score) <- processText(str)) {
        if (score > max) {
          back.clear
 
          back += lang
          max = score
        } else if (score == max) {
          back += lang
        }
      }
 
      if (back.length == 1) Some(back(0)) else None
    }
 
    def processText(str: String) = {
      val langs = if (this.langs.contains(all)) all.langs else this.langs
      val prime = langs.foldLeft(Map[Language, Int]()) { _(_) = 0 }
 
      str.split("""\s+""").foldLeft(prime) { (map, token) =>
        langs.foldLeft(map) { (map, lang) =>
          if (lang.words.contains(token.toLowerCase)) {
            map(lang) = map(lang) + 1 
          } else map
        }
      }
    }
  }
}

Conclusion

Well, it’s been a fun and informative journey through the fertile land of bloom sets and simple statistical language analysis.  As usual, all of the source (and binaries) are available for download.  It’s a bit of a shame, but I didn’t get a chance to discuss a number of rather interesting features also available in my bloom filter implementation (such as concatenation of bloom filters).

Bloom filters are quite interesting in and of themselves, and definitely a useful technique that every developer should keep in their back pocket.  Peter Cooper definitely merits kudos for his implementation of WhatLanguage based on bloom filters, but despite the fact that I took the time to port it to Scala, I still don’t see much practical benefit.  At any rate, the library is now available in Scala; hopefully someone else will find it more useful than I do!

Comments

  1. The problem with JRuby probably wasn’t anything to do with the performance of Ruby or the computational characteristics of the system and everything to do with a slow Java integration layer. Up until JRuby 1.1.4, most calls from Ruby to Java were dramatically slow. And by dramatically, I mean in the neighborhood of 10-20x slower than they are in 1.1.4. Even in 1.1.4, they come out 2-5x slower than Ruby-to-Ruby calls. This perf hit is largely due to the cost of marshalling different types of objects, though before 1.1.4 it was due to a large part of the Java integration layer being itself written in suboptimal Ruby code. Also, because Ruby’s strings are byte[] where Java’s strings are char[], currently JRuby must decode and encode all strings passing across that boundary. Since your script likely opened up a file, read strings, and dumped them to your fast Java or Scala-based library, you were suffering a death of a million cuts as each bit of string was repeatedly marshalled and unmarshalled. These are known issues with JRuby, and largely we recommend that computationally-sensitive code toss large units of work over the fence, rather than make thousands of tiny expensive calls. For a long time the priority has been Ruby first, Java second; but with Ruby looking very solid now, we are turning our attentions more and more toward making JRuby work better with Java’s libraries and type system.

    Charles Nutter Monday, October 13, 2008 at 2:48 am
  2. > Java (and by extension, Scala) does not really give us raw access to bit addresses, but the JVM is still able to optimize boolean arrays into very efficient bit-based representations.

    This is incorrect. A boolean (even in an array) takes up at least one byte. This is to guarantee the memory model semantics (CPUs can’t do an atomic put to a single bit). For efficient storage of bits you should use the BitSet class. That could make the naive approach possible. It might be possible to write specialization of your Vector which could be much more efficient for this case, but I can’t immediately see how because of the use of heterogenous arrays.

    Also I suspect that WhatLanguage is probably not using the best way to determine language. For many languages a character set check will do; for latin languages a character histogram might work.

    Asd Monday, October 13, 2008 at 3:00 am
  3. Also your iterative hash function is terrible:

    scala> hash(“banana”, 0, 100)
    res13: Int = 27

    scala> hash(“banana”, 1, 100)
    res14: Int = 26

    scala> hash(“banana”, 2, 100)
    res15: Int = 24

    scala> hash(“banana”, 3, 100)
    res16: Int = 27

    Asd Monday, October 13, 2008 at 3:15 am
  4. I’ve been looking into these for a quick method of finding which chromosome a sequence of DNA hits :) I have seen on my travels another way of generating a list of hashes that I thought was quite neat using the Random class (in java)

    Basically, taking the Random object with the seed set to the hashCode of the object to be encoded, and then calling nextInt( width ) as many times as you want hashes (I tried to post this before, but can’t work out how to post code as a comment) ;)

    There was some research into detecting languages in 2002 as well, and the method they came out with was using Huffman encoding and then inspecting the size of the dictionary. Their paper was published here in the Physical Review Letters Journal:

    http://prola.aps.org/abstract/PRL/v88/i4/e048702

    And the NYT writeup is here:

    http://query.nytimes.com/gst/fullpage.html?res=9F0CE6D71F3EF933A05757C0A9649C8B63

    Tim Yates Monday, October 13, 2008 at 3:17 am
  5. Another small point :)

    Math.abs(x) % b is not safe to use.

    scala> Math.abs(Integer.MIN_VALUE) % 100
    res28: Int = -48

    Asd Monday, October 13, 2008 at 3:21 am
  6. Why fix the size of the bloomfilter? You get much more efficient structures if you scale k and m, and a much neater API; you just tell it how many items you have and what false positive rate you can put up with, and get back m and k. You also don’t get insanity like “Optimal K: 3470″. This was in C, but I converted it to Ruby for you:

    # –
    def size_filter(capacity, error = 1 / 1000000.0)
    m = ((capacity * Math.log(error)) / Math.log(1.0 / (2.0 ** Math.log(2.0)))).ceil
    k = (Math.log(2.0) * m / capacity).ceil
    {:k => k, :m => m}
    end

    p size_filter(234936, 0.03) # => {:m=>1714667, :k=>6}
    p size_filter(399, 0.03) # => {:m=>2913, :k=>6}
    # –

    Also, after finding how poorly “fast” hashes perform on bloomfilters, I ended up just using SHA512; you get excellent distribution, and can make a lot of k’s out of all those bits.

    Thomas Hurst Monday, October 13, 2008 at 5:17 am
  7. Just want to second Asd’s first point – you are wasting 7 bits for every bit you store by using a Vector, especially unfortunate given that Bloom Filters are supposed to save space. Use a BitSet.

    Ian Clarke Monday, October 13, 2008 at 6:50 am
  8. It’s not the factorial, it’s n+(n-1)+(n-2)+… = n(n-1)/2.

    If it were the factorial, even 30 hash functions would be waaaaay too much.

    Paolo Bonzini Monday, October 13, 2008 at 7:51 am
  9. @Charles

    Your note on performance actually makes a lot of sense. I was running this with JRuby 1.1.2. For the record, I wasn’t really expecting (or demanding) performance approaching Scala’s in JRuby, I was just somewhat surprised to see how divergent it really was.

    The String conversion was a beast as well. I actually had to force the conversion by explicitly creating a new instance of java.lang.String, which makes things all the more worse.

    @Asd and @Ian

    I’m a little surprised that the JVM doesn’t optimize boolean[]. Considering all the other complicated optimizations performed by the JIT, this would seem to me to be low hanging fruit.

    In any case though, you’re right that the JVM isn’t optimizing BloomSet down to one bit per element. Even if it could optimize boolean[], Vector isn’t using a typed array under the surface. It’s actually even worse than a straight array of primitive booleans due to the fact that every boolean has to be stored in its boxed form. Watching the memory usage of an example application using WhatLanguage shows that life isn’t *too* terrible, but it’s still a lot more than I would have liked.

    Thinking about it, the best way to solve this would be (as you say) to create a new persistent data structure based on the Vector concept which stores bits in Integer masks. Conveniently, this would still preserve all of the exact properties of Clojure’s persistent vector since integers contain 32 bits. This sort of data structure would not be possible just as a “specialization” of Vector, but the same ideas could be cross-applied through (mostly) copy-paste coding.

    Java’s BitSet isn’t really an option because it is mutable. I really strongly believe in using immutable data structures except where they carry a severe performance hit. Immutable data structures can be used mutably if you really need to, but they also carry all of the benefits of air-tight thread safety and dirt cheap versioning (not to mention correctness assurance in cases where a return value may expose implementation details).

    @Asd (again)

    It would seem that you’re right about hash. I guess that’s what happens when I try to re-tune algorithms at the last minute… :-S Originally, I had a very different version of hash based on Random (@Tim). As I was actually writing the article, I attempted to pair down the implementation just a bit so as to improve the readability. Seems I went just a little too far…

    Considering that there is exactly one integer value (out of 2^32 – 1) that reproduces this issue, I think I’m pretty safe. :-) Do you know of another function which does better? I suppose I could just right-shift everything by one place, which yields the more natural top-end result: Integer.MIN_VALUE >> 1 == Integer.MAX_VALUE.

    @Tim

    Character repetition frequency is an interesting way of detecting language. This would be a lot more memory-efficient than assembling bloom filters. However, I’m not entirely sure that it would be as accurate. The nice thing about the dictionary approach is fairly easy to analyze its properties and discover why it is wrong in such cases that it is. This contrasts pretty heavily with a more statistical approach, which would be very difficult to debug. I’m sure there are pathological English sentences that look a lot like French, German or even Dutch (considering the similarities in the languages).

    @Thomas

    I thought about letting the width float, but apparently you’re more ambitious than I am in terms of calculating optimizations. :-) The BloomSet API is capable of accepting a configurable width, so this would only require changes in the GenLangs.scala script. Thanks for the pointer!

    @Paolo

    When I initially looked at the algorithm, it seemed closer to O(n^2), but somehow my second glance told me it was O(n!). I’ll correct the article. Thanks for catching my error!

    Daniel Spiewak Monday, October 13, 2008 at 10:06 am
  10. As a followup, I experimented a bit with the floating width. These are the results from before its implementation:

    wl.processText(“This is a very long string which contains a lot of different words. The goal is to try to confuse the language processor and draw out some more contrasting results in terms of language scoring”)

    => Map(Portuguese -> 10, English -> 31, Swedish -> 5, Farsi -> 3, German -> 1, Spanish -> 7, French -> 13, Dutch -> 15, Pinyin -> 1)

    And afterward:

    => Map(Portuguese -> 9, English -> 30, Swedish -> 7, Farsi -> 2, German -> 1, Spanish -> 3, French -> 13, Dutch -> 17, Pinyin -> 1)

    So oddly enough, it seems that the floating width has a slightly deleterious effect on accuracy, at least with this particular string. Overall accuracy should have been improved, particularly with strings containing Russian words. The only unfortunate consequence is the memory requirements have increased due to the average width being higher than 2,000,000. Still, an interesting experiment to be sure.

    Daniel Spiewak Monday, October 13, 2008 at 10:56 am
  11. @Daniel, using an immutable datastructure for a BloomFilter is a good example of where blind loyalty to immutability is a bad idea :-)

    I’m not terribly familiar with how a bitset would be implemented in an immutable manner, but my guess is that you would throw away much of the performance benefit of using a BloomFilter.

    Immutability is beneficial in many, but not every circumstance.

    Ian Clarke Monday, October 13, 2008 at 12:53 pm
  12. @Ian

    Well, not *quite* blind loyalty. :-) I really, really like the benefits that immutable data structures carry with them. I mean, think about it, how many times have you done something like this by accident?

    public class Bean {
    private String[] data;

    private String[] getData() {
    return data;
    }
    }

    There’s no setter for data, but it doesn’t really matter; anyone can call getData() and then change the internals. It’s even worse with more complex data structures like List and Map. And don’t even get me started on thread synchronization issues… :-)

    In terms of performance, immutable data structures can be pretty darn fast. My benchmarking of Rich Hickey’s persistent vector shows it to be within 5x-10x of the speed of an ArrayList on “writes” (remember, a new Vector each time), and between 1x and 2x the speed on reads. It’s hard to argue that these factors constitute a significant performance hit, especially when we remember that lookups in an ArrayList are on the order of 0.5-1ms.

    It should be possible to implement a bitset immutably using much the same technique as is used in the persistent vector. However, rather than the lowest level of the trie being composed of a 32-element array, this level would be a single 32-bit integer. Because of the speed of integer copying on modern processors (constant), this sort of data structure would actually be phenomenally faster than a general-purpose Vector. Granted, this is to be expected since it is limited to bit values, but it’s still a cool conclusion to trot out. :-)

    With this sort of implementation, I don’t think you would really have to discard any of the benefits of a bloom filter. Granted, it would certainly have a higher “write” overhead than a mutable bloom filter, but its reads would be very nearly (if not exactly) as fast. The “write” overhead wouldn’t be too much of a concern because bloom filters are usually used immutably anyway. In short, I think the benefits of a fully immutable bloom filter would far outweigh the minimal performance penalty imposed by such a structure.

    P.S. Oh, I’m not really claiming that immutable data structures are computationally as efficient as their mutable counterparts. Even Clojure’s persistent vector is really only log32(n) for reads, which is close to constant time but still not quite. However, in most scenarios that I work on, immutable data structures lend considerably to the correctness and logical “reasonability” of the code. Add in their inherently stellar throughput when dealing with concurrency, and I think the overall benefits are more than sufficiently compelling (at least, they are to me).

    Daniel Spiewak Monday, October 13, 2008 at 1:21 pm
  13. Awesome writeup! Very interesting. Glad you liked the WhatLanguage work. It turns out that there are other statistical ways to analyze and determine that language that might be significantly more efficient (not just in speed but memory too) – many of these are linked in the WhatLanguage post on Ruby Inside.

    That aside, this is a great introduction to the topic and it was interesting to see the Scala code. I own scalainside.com, but unfortunately never got into the language enough to go full steam into it.

    Peter Cooper Monday, October 13, 2008 at 7:01 pm
  14. I made this same point on the original WhatLanguage blog post, but Bloom Filters are not the way to do this. Anything that operates on the word level will have trouble with morphological artifacts, e.g., pluralization, verb tenses, etc.

    It gets much more complicates in agglutinative languages like Hungarian and Finnish, and even languages like German which can basically synthesize arbitrary words and still remain intelligible German.

    A better way to determine a language is to use an n-gram model. The distribution of n-grams across a language is a great finger print and you can identify most languages within a word or two.

    Cheers,
    Jesse

    Jesse Farmer Monday, October 13, 2008 at 9:50 pm
  15. We solved the “what language is this phrase” using a naive Bayesian classifier. Works great; takes not too long to train if you pick the training set well, and uses much, much less memory.

    Eric Bowman Sunday, December 6, 2009 at 2:07 am
  16. > There’s really no need to put them in the main class, and the JVM is able to optimize dispatch to static members to a slightly higher degree than it can instance members (invokestatic).

    Putting methods in the companion object does not save you any virtual method calls. I think that scalac generates an extra static method for better Java interop, but the code in the article won’t be using any static method calls.

    Craig P. Motlin Monday, January 24, 2011 at 8:17 am

Post a Comment

Comments are automatically formatted. Markup are either stripped or will cause large blocks of text to be eaten, depending on the phase of the moon. Code snippets should be wrapped in <pre>...</pre> tags. Indentation within pre tags will be preserved, and most instances of "<" and ">" will work without a problem.

Please note that first-time commenters are moderated, so don't panic if your comment doesn't appear immediately.

*
*