Skip to content

Scala as a Scripting Language?

3
Nov
2008

I know, the title seems a bit…bizarre.  I don’t know about you, but when I think of Scala, I think of many of the same uses to which I apply Java.   Scala is firmly entrenched in my mind as a static, mid-level language highly applicable to things like large-scale applications and non-trivial architectures, but less so for tasks like file processing and system maintenance.   However, as I have been discovering, Scala is also extremely well suited to basic scripting tasks, things that I would normally solve in a dynamic language like Ruby.

One particular task which I came across quite recently was the parsing of language files into bloom filters, which were then stored on disk.  To me, this sounds like a perfect application for a scripting language.  It’s fairly simple, self-contained, involves a moderate degree of file processing, and should be designed, coded and then discarded as quickly as possible.   Dynamic languages have a tendency to produce working designs much faster than static ones, and given the fact that the use-case required access to a library written in Scala, JRuby seemed like the obvious choice (Groovy would have been a fine choice as well, but I’m more familiar with Ruby).  The result looked something like this:

require 'scala'
 
import com.codecommit.collection.BloomSet
 
import java.io.BufferedOutputStream
import java.io.FileOutputStream
 
WIDTH = 2000000
 
def compute_k(lines, width)
  # ...
end
 
def compute_m(lines)
  #...
end
 
Dir.foreach 'wordlists' do |fname|
  unless File.directory? fname
    count = 0
    File.new "wordlists/#{fname}" do |file|
      file.each { |line| count += 1 }
    end
 
    optimal_m = compute_m(count)
    optimal_k = compute_k(count, WIDTH)
 
    set = BloomSet.new(optimal_m, optimal_k)
 
    File.new "wordlists/#{fname}" do |fname|
      file.each do |line|
        set += line.strip
      end
    end
 
    os = BufferedOutputStream.new FileOutputStream.new("gen/#{fname}")
    set.store os
    os.close
  end
end

As far as scripts go, this one isn’t too bad.  I’ve written some real whoppers for things like video encoding and incremental backups.  The main trick here is the fact that we need to make two separate passes over the same file in order to get the number of lines before constructing the set.  We could load the file into an array buffer in a single pass, count its length and then iterate over the array, placing each element in the bloom filter.   However, this really wouldn’t be too much faster than just hitting the file twice (we still need two separate passes) and it has the additional drawback of requiring a fair amount of memory.

All in all, this script is a fairly natural representation of my requirements.   I needed to loop over a number of word lists, push the results into separate bloom filters and then freeze-dry the state.  However, look at what we’ve actually done here.  Remember earlier where we were considering which language to use?  We wanted a language which could concisely and quickly express our intent.  For that decision making process, we just assumed that a dynamic language would suffice better than one hampered by a static type system.  However, at no point in the above script do we actually do anything truely dynamic.  By that I mean: open classes, unfixed parameter types, method_missing, that sort of thing.   In fact, we haven’t really done anything that we couldn’t do in Scala:

import com.codecommit.collection.BloomSet
import java.io.{BufferedOutputStream, File, FileOutputStream}
import scala.io.Source
 
val WIDTH = 2000000
 
def computeK(lines: Int, width: Int) = // ...
 
def computeM(lines: Double) = // ...
 
for (file <- new File("wordlists").listFiles) {
  if (!file.isDirectory) {
    val src = Source.fromFile(file)
    val count = src.getLines.foldLeft(0) { (i, line) => i + 1 }
 
    val optimalM = computeM(count)
    val optimalK = computeK(count, optimalM)
 
    val init = new BloomSet[String](optimalM, optimalK)
 
    val set = src.reset.getLines.foldLeft(init) { _ + _.trim }
 
    val os = new BufferedOutputStream(new FileOutputStream("gen/" + file.getName))
    set.store(os)
    os.close()
  }
}

This is actually runnable Scala.  I’m not omitting boiler-plate or cheating in any similar respect.  If you copy this code into a .scala file and make sure that BloomSet is on your CLASSPATH (which you would have needed anyway for JRuby), you would be able to run the script uncompiled using the scala command.  Unlike Java, Scala actually includes an “interpreter” which can parse raw Scala sources and execute the representative program just as if it had been pre-compiled using scalac.   One of the perquisites of this approach is the ability to simply omit any main method or Application class.  In nearly every sense of the word, Scala is a scripting language…as well as an enterprise-ready Java-killer (let the flames begin).

Now that we’re fairly convinced that the above is valid Scala, let’s compare it with the original version of the script written using JRuby.  If we just go off LoC (Lines of Code), Scala actually wins here.  This was a more-than-slightly surprising discovery for me, given how often dynamic languages (and Ruby in particular) are touted as being more concise and expressive than static languages.  But of course, sheer LoC-brevity isn’t everything: we also should consider things like readability.  A few characters of Befunge can accomplish more than I can do in several lines of Scala, but that doesn’t mean I’ll be able to figure out what it means tomorrow morning.

On the readability score, I think Scala wins here too.  The file processing and set creation is all done in a highly functional style (using foldLeft).   At least to my eyes, this is a lot easier to follow than the imperative form in Ruby.  More importantly, I think it’s a bit harder to make silly mistakes.  When I wrote the Ruby version of the script, it took several tries before I solidly pinned down the exact incantation I was seeking.  The Scala version literally required only one revision after the initial prototype.   Granted, I had the Ruby version to go off of, but I think we would all agree that the scripts use some fairly different libraries and methodologies for accomplishing identical tasks.

So what is it that makes Scala so surprisingly well suited to the task of quick-and-dirty file processing and scripting?  After all, isn’t is just a fancy syntax wrapping around the plain-old-Java standard library?  While it is true that Scala has first-class access to Java libraries (as demonstrated in the script), that isn’t all that it offers.  I believe that Scala has two important features which make it so suitable for these tasks:

  • Type inference
  • Powerful core libraries

The first feature is of course evident wherever you look in the script.  With the exception of the two methods and the BloomSet constructor, we never actually declare a type anywhere in the script.  This gives the whole thing a very “dynamic feel” without actually sacrificing static type safety.   The first time you try this sort of language feature it is an almost euphoric experience (especially coming from highly-verbose languages like Java).

The second feature is a bit harder to see.  It is most evident in the way in which we handle file IO.  The directory listing is of course yet another application of the venerable java.io.File class, but the process of opening and reading the file line-by-line seems to be a lot easier than anything Java can muster.  This is made possible by Scala’s Source API.   Rather than fiddling with BufferedReader and the whole menagerie that goes along with it, we just get a new Source from a File instance and then use conventional Scala methods to iterate over its contents.  In fact, we’re actually applying a functional idiom (fold) rather than a standard imperative iteration.  Finally, when we’re done with our first pass, we don’t need to re-open the file from scratch (inviting initialization mistakes in our coding), we just reset the Source and start from the beginning once more.

Using Scala as a scripting language comes with some pretty hefty benefits.   For one thing, you get immediate and idiomatic access to the mighty wealth of libraries which exist in Java.  Even for scripting, this sort of interoperability is invaluable.  JRuby does provide some excellent Java interop, but it simply can’t compare to what you get with Scala.  Further, Scala has a static type system to check you (at runtime with a script) to ensure that you haven’t done anything obviously bone-headed.  This too is nothing to sniff at.

Given the fact that Scala’s “scripting syntax” is just as concise as Ruby’s (sometimes more), it’s hard to see a reason not to employ it for around-the-server tasks.  Amusingly, the most compelling reason not to use Scala for scripting just might be its comment syntax.  Not having direct support for the magic “hash bang” (#!) incantation to define a file interpreter just means that Scala scripts have to go through some extra steps to be directly executable.  However, if immediately-executable scripts aren’t an issue, you may want to consider Scala as your scripting language of choice for your next non-trivial outing.  You may reap the rewards in ways you weren’t even expecting.

Bloom Filters in Scala (and all the fun that they bring)

13
Oct
2008

This is a post I have been meaning to write for quite some time.  Despite being an officially excommunicated former-member of the Ruby community, I still like to keep up on the latest developments.  A few months ago, my Ruby Inside feed produced an interesting post introducing a new library by Peter Cooper (the author of Ruby Inside).  Unlike the most “new” Ruby libraries which are most often just bad clones of Rails snibblets, Peter’s WhatLanguage seemed to be quite unique and interesting.  For the first time in years, I was reading about an innovation in Ruby-land which actually sounded quite exciting.  A good feeling to be sure.

In a nutshell, WhatLanguage (project page at GitHub) is a statistical language identification API which analyses arbitrary input String(s), scoring them against a set of built-in languages (including English, French, Spanish, Russian, and others).  Add to that a little bit of meta-programming magic and the possibilities are eye-catching to say the least:

require 'whatlanguage'
 
'Here is some text to identify'.language       # => :english
'¿Hola, pero donde está su pelo?'.language     # => :spanish

To be honest, I’m not sure what practical application I would have for such a library, but it sure looks like fun!  :-)  Under the surface, the implementation is deceptively simple.  WhatLanguage bundles a set of words for each language it supports.  The implementation then searches these sets for each word in the string, aggregating a score for each language depending on whether or not the word was found.  If exactly one language has a higher score than the others once the end of the string is reached, it is “declared the winner” and returned as a result.  Otherwise, the analysis is undecided and nil is returned.  Simple, right?

The tricky part is somehow storing and searching those gigantic language word sets in an efficient manner.  As it turns out, this isn’t so easy.  Just stop and consider the number of words which exist in your native tongue.  Unless you speak a very strange, primarily-tonal language, the number is likely in the upper hundreds of thousands.  Just going off of Peter’s word lists, Russian is the real trouble maker, weighing in with a nearly one million word vocabulary.  Any attempt to store that much data would require megabytes of memory and several minutes to process, even on a fast machine.  Any sort of hash set is completely out of the question due to the nature of contiguous memory allocation, but can you imagine trying to perform a search traversal on a tree with almost a million nodes?  Obviously, a more effective solution is required.

Probabilistic Sets

So far we have been focusing our imaginary efforts on data structures which guarantee 100% accuracy and full reversibility.  That is to say, data structures which allow you to get out what you put in and which ensure that if something is “in” it will be found and if something is “out” it will not.  After all, is a data structure without these guarantees really all that useful?

As it turns out, the answer is “yes”.  Consider a standard hash set as an example.  Sets make absolutely no guarantee about the ordering of their elements.  If you attempt to iterate over a set, you will eventually reach everything inside, but you cannot rely on a certain element coming before the others.  Of course, the advantage we gain by making this trade-off is constant time searching by item identity.  Lists cannot achieve this performance, neither can arrays.  If you want to search either one of these structures, you must look through every element from start to finish.  So with hash sets, we have effectively traded a fundamental guarantee about our data for efficiency in some other respect.

The same is true for probabilistic sets.  We don’t actually need all of the properties of a set to implement WhatLanguage.  For example, we don’t need to actually retrieve a word from our dictionary, we only need to discover whether or not that word is in the set.  In fact, we really only need our data structure to have two important properties:

  • Lightning fast boolean searches (does the set contain this element?)
  • Minimal memory footprint

Hash sets can make the first guarantee, but as we have already noticed, they don’t really require any less memory than just storing everything in an array.  Since we can’t afford one million words-worth of memory per language, we need to come up with some way of storing information about the presence of an element without retaining the data it represents.  In order to do this, we have to make one important sacrifice: accuracy.

We basically need to accept the fact that our data structure may sometimes return inaccurate results.  As it turns out though, this is just fine.  Remember, we’re scoring languages one token at a time; an incorrect answer now and again isn’t going to kill us.  Given a large enough input string, a minimal percentage of inaccuracy will literally be lost in the noise.

As it turns out, there already exists a very old technique for accomplishing exactly these requirements.  Bloom filters have been around since the beginning of time (or at least the 1980s), but unfortunately very few developers are aware of them.  In a nutshell, they precisely satisfy all of the needs we just finished enumerating.  They provide true (not amortized) constant-time searches and will consistently return true or false for a given element, depending on whether or not the value is in the set.  Occasionally they will give false positives (claiming that an element is in the set when in fact it is not), but never false negatives.  This property is extremely useful in many applications, but it turns out that this is even more than we need for WhatLanguage.

Most importantly though, bloom filters are extremely compact in memory.  Rather than storing an entire value, they hash every element to one or (usually) more indexes in an array of bits (0 or 1).  In an optimal bloom filter guaranteeing no false positives, each element would correspond to precisely k bits of memory (where k is the number of hash functions, usually in the vicinity of 3 or 4).  Once this hashing is complete, the value itself may be discarded.  Thus, assuming an optimal bloom filter with 3 hash functions, the storage requirements for one million words would only be 375 KB of memory.  Compare this to the 8 MB required just to store the pointers to each of the words, much less the values themselves!  If we include the underlying character values, we start pushing closer to 20 MB.

375 KB for a million elements is impressive, but bloom filters can actually do a lot better than that.  Remember, we agreed to accept some false positives.  It’s easy to imagine how even an ideal bloom filter (such as the one we have been analyzing) could have some inaccuracy, but if we’re willing to live with a few more false positives, we can allow hash collisions and drop our memory requirements into the 200 KB range.  Not bad, considering how much data we’re storing!

Bloom Filters

So how does the magic actually work?  We’ve hinted at the implementation in our analysis of memory requirements, but we’re still missing some important details.  Consider the following diagram (courtesy of Wikipedia):

Under the surface, a bloom filter is just an array of boolean values.  The length of the array is referred to as the width of the bloom filter.  When the filter is empty, all of the array values are set to false.  To insert an element, we compute k different hash indexes in the array and set them all to true.  The real beauty of the data structure is the manner in which we handle collisions: we don’t worry about it.  Whenever one (or more) of the hash functions returns an index which is already true, we simply leave that index as-is and move on.  Thus, the procedure for inserting an element into a “full” bloom filter is precisely the same as inserting into an empty one: set all of the hash indexes to true.  In a sense, bloom filters can never be full, they just get less and less accurate.

Bloom filters define only one operation other than insertion: searching.  To determine whether or not a value is in the set, we run all of our hash functions and find the indexes to which it corresponds.  If all of these indexes are true, then we can assume that the value is indeed in the set.  Conversely, if any one of indexes is false, then the value cannot be in the set since we would have set that value to true upon insertion.

It is important to note that removal is not supported by bloom filters.  Intuitively we could just take the value to be removed, calculate its hash indexes and then set the corresponding elements to false.  However, if any other values also hash to even one of those locations, we will be effectively removing it from the bloom filter “by accident”.  Subsequent searches for that value will hash to the index which we erroneously set to false, resulting in a false negative on the search (saying that an element is not in the set when in fact it is).  Fortunately for us, WhatLanguage only requires insertion and searching for values, removal is not on the agenda.

Scala Implementation

Well, this has probably been a nostalgic (for those of you who enjoyed your university days) stroll through the land of data structure theory, but we’re still lacking one critical factor: a concrete implementation.  Bloom filters don’t sound that complicated on paper, but to do anything useful with them we’re going to need some code.

Before we dive in, we might want to define some more precise requirements for our implementation.  We’re writing this in Scala, so it would be nice to implement bloom filters as a functional data structure.  I’m not going to get into all the benefits of immutable data structures; suffice it to say, I really think that they’re the only way to go for most serious applications.  Unfortunately, in order to do this, we’re going to need some sort of immutable, array-like data structure under the surface which provides random access in constant time.  Sound familiar?

This is actually where the whole saga of persistent vectors in Scala found its roots.  The first time I tried to implement a functional bloom filter in Scala I made use of a conventional array under the surface with copy-on-write semantics.  As you can imagine, this is hideously inefficient in both space and time.  In order to solve the problem, I had to implement a data structure which was just as efficient for reading and writing, but which also shared structure with subsequent versions.  Now, two months later, I’ve borrowed just such a design from Clojure and created a first-class implementation in Scala.

With a persistent vector now at our disposal, we can easily implement a persistent bloom filter by using a vector to represent our boolean indexes.  Every time we perform an insertion, we will “modify” the relevant indexes, creating a new instance of the vector.  This vector can then be used in our “modified” bloom filter, returned as a new instance of our container class.  Since persistent vectors share unmodified structure between versions, so does our bloom filter.  All of the hard work of efficient immutability is already done for us, we just need to implement a thin facade.  We start out by defining the class skeleton:

import BloomSet._
 
class BloomSet[A] private (val size: Int, val k: Int, 
                           private val contents: Vector[Boolean]) extends ((A)=>Boolean) {
  val width = contents.length
 
  def this(width: Int, k: Int) = this(0, k, alloc(width))
}
 
object BloomSet {
  def alloc(size: Int) = {
    (0 until size).foldLeft(Vector[Boolean]()) { (c, i) => c + false }
  }
}

We’ll keep all of our utility functions in the companion object.  There’s really no need to put them in the main class, and the JVM is able to optimize dispatch to static members to a slightly higher degree than it can instance members (invokestatic).  Technically, we don’t really need the alloc function, it just hides some of the boiler-plate involved in initializing a vector to a certain length.  Note that this operation is extremely fast, even for high values of width.  Despite the fact that we are creating a huge number of Vector objects, very little memory is actually wasted.  Even at this early stage in the implementation, the advantages of persistence are evident.

A Little Math

You’ll notice that our bloom filter has three major properties: size, k, and width.  As previously mentioned, k is the number of hash functions to be used, while width is the length of the array used to store the boolean values.  These values have to be user-configurable because different applications will call for different optimizations.  For example, implementing WhatLanguage requires the storage of almost a million elements per set.  Obviously we would not want to attempt to use a bloom filter which only had a width of (say) 200.  This wouldn’t cause any errors per se (remember, bloom filters are never “full”), but the accuracy of such a filter would be almost nil.  With a million elements all hashed into a mere 200 indexes, it is likely that that every single index would have a value of true.  Such a bloom filter would return true when searching for any value, including those which were not inserted.  We said we would tolerate some false positives, but we still want the results to be meaningful.  Since the width of a bloom filter cannot be changed once data is inserted (at least, not without “losing” data), the user must be able to decide in advance how much space will be required and then instantiate BloomSet accordingly.

The value of k is an even more interesting proposition.  The intuition is that the larger our value of k (thus, the more hash functions we use), the greater the accuracy of our bloom filter and the fewer false positives we will return.  This is technically true when the bloom filter does not contain very many elements, but as more and more values are inserted, higher values of k become problematic.  Wikipedia has an excellent (and extremely readable) mathematical analysis of this and other properties of bloom filters.  As it turns out, the optimal value for k (one which will yield the fewest false positives) can be computed by the following expression (where m is the width and n is the size, or the number of elements contained in the filter):

k = \mbox{ln}(2) \; \frac{m}{n}

For the sake of quick-computation, this is essentially equivalent to:

k \approx \frac{9m}{13n}

The equivalence isn’t precise (logarithms are almost always irrational), but since we’re already rounding the result to the nearest integer, we don’t need to be too concerned.

While it would be nice if we could just auto-magically set k to the optimum value and be done with it, that just isn’t the case.  Like width, k cannot be changed after data has been inserted.  Unfortunately, its optimum value is dependent on the number of items which will eventually be in the set.  Since we don’t really “know” this value within the data structure until after the insertions are complete, there is no way for us to self-optimize.  To provide maximum configurability, we need to allow this value to be set by the API consumer on a case-by-case basis.

As an interesting aside, we can actually compute the accuracy of a bloom filter on a scale from 0 to 1 given values for width, k and size.  In this case, accuracy is defined as the odds that a search result is indeed accurate (the chance of it not being a false positive).  Wikipedia gives the following expression for this accuracy:

1 - (1 - e^{-kn / m})^k

Thanks to the marvels of approximation, this is a fairly efficient computation on modern platforms.  Translated into Scala, it looks something like this:

lazy val accuracy = {
  val exp = ((k:Double) * size) / width
  val probability = Math.pow(1 - Math.exp(-exp), k)
 
  1d - probability
}

Core Functionality

As interesting as this minor detour has been, we still haven’t arrived at a working implementation of the bloom filter data structure.  We have the basic structure in place, but we still need to implement the two core operations: insert and search, rendered in Scala as + and contains:

def +(e: A) = new BloomSet[A](size + 1, k, add(contents)(e))
 
def contains(e: A) = {
  (0 until k).foldLeft(true) { (acc, i) => 
    acc && contents(hash(e, i, width)) 
  }
}
 
protected def add(contents: Vector[Boolean])(e: Any) = {
  var back = contents
 
  for (i <- 0 until k) {
    back = back(hash(e, i, width)) = true
  }
 
  back
}

Feels sort of anti-climactic doesn’t it?  After all that analysis, we find that it is possible to implement the essence of a bloom filter in less than 17 lines of code.  Technically, I probably could trim this down to more like 8 lines, but this isn’t an exercise in Scala golf.  These three methods are just the materialization of all of the hand-wavy descriptions that we’ve spent the last five pages working through.  Each time we insert an element, we calculate k different hash values and set the appropriate vector indexes to true.  To find that element, we calculate the same k hashes and ensure that all of the indexes have an appropriate boolean value.

Hashing

Those of you who are actually reading the code snippets rather than just blindly trusting my veracity will probably notice that I’ve omitted a fairly important function from the above sample; specifically: hash.  Somehow, we need to define a function which will simulate the existence of a potentially unlimited number of hash functions.  Based on our usage of this function, we need to be able to grab a hash value for a given value and function index, restricted to a given width.  This is not a trivial problem.

As a good developer, you should already be trying to think up ways to reduce the code involved in implementing these requirements.  I mean, we could potentially just dream up a few dozen different hash functions with a fairly wide distribution of results.  However, I don’t know about you, but I really don’t have that much imagination.  I have a hard enough time developing just one solid hash function, much less dozens.  What’s more, if each hash function corresponds with a different method, then we can’t really scale the number of hash functions to any integer k.

A better approach would be to create a single hash function and then skew its value n times for every n between 0 and k.  Like most things that have to do with hashing, it’s cheating, but providing we do our job right, it should give us a fairly decent way of computing the required k hash functions.  Of course, the key phrase there is “do our job right”.  We need to make sure that we skew the initial hash value sufficiently for each successive hash, otherwise successive hash functions might get “stuck in a groove”.  For example, let’s assume that each successive hash function is equivalent to the previous hash + 2.  If we then inserted two elements into the bloom filter with k = 3, there would be the potential that the hash(0) for the first element would be precisely 2 greater than the hash(0) for the second.  If this is the case, then hash(1) for the second would equal hash(0) for the first.  Thus, the first and second elements would collide on two out of three indexes.  Not exactly an optimal distribution.

To avoid this problem, we should be sure to skew the hash value by a different amount for each hash function.  A convenient way of doing this is to use the number of iterations, or the index of the particular hash function.  To further improve the distribution, we will also use the bitwise XOR operation rather than simple addition.  It’s not really intuitively obvious, but XOR is mathematically far superior for hashing and just as cheap in terms of processor cycles.  Put it all together, and the implementation looks something like this:

def hash(e: Any, iters: Int, bounds: Int): Int = {
  Math.abs(
    if (iters == 0) e.hashCode 
    else iters ^ hash(e, iters - 1, bounds)
    ) % bounds
}

The final bit of math in this function mods our skewed hash value with the given bounds.  Whenever we call hash, we always pass the width of the bloom filter.  This is a trick that you’ll see a lot in any hash-based data structure.  Essentially, we’re just mapping our hash value into the required domain.  Hash values can be anything between -231 and 231 - 1, but our vector indexes can only be between 0 and width - 1.  Just a rule of thumb, but any time you have a problem where you must “map values from a large domain to a more restricted one”, mod is generally the correct approach.

Observant readers will notice that our implementation of hash is a little inefficient.  Actually, it’s worse than just a “little” inefficient: it’s downright terrible.  Remember that we have to call this method once for every integer n between 0 and k.  This isn’t so bad, since k is usually comparatively small, but the real problem is our sneaky use of recursion within hash.  The time complexity of hash is precisely O(n), where n is the number of iterations.  Once you factor this in with the number of times we call this function and it is easy to see how the total complexity of computing k hash values is actually O(k2).  Thus, for k = 4, we actually call hash 16 times!  You can imagine how this could get very problematic for higher values of k.

Technically, we could rewrite add and contains to compute the hash values incrementally from the bottom up.  This would bring the hashing complexity down to O(k).  However, this is a bit more elegant in terms of code, not to mention easier to talk about in a modular fashion.  In practice, this inefficiency is only a problem for higher values of k due to the fact that computing a single-step hash value is actually a very fast operation.  We will run into trouble in a little bit though when we attempt to optimize the bloom sets for shorter wordlists in WhatLanguage.

Serialization

One of the important requirements of WhatLanguage that we haven’t really touched on so far is the ability to store freeze-dried bloom filters on disk and then suck them back into memory quickly and efficiently.  If you think about it, this requirement only makes sense since one of our primary motivations from the start was to avoid the burdens associated with huge word lists.  After all, if our in-memory representation of a language is only a few hundred kilobytes, shouldn’t the on-disk representation be no worse?  Furthermore, if we had to parse a wordlist into a bloom filter every time we wanted to check a string’s language, the performance imposition would be utterly prohibitive.

Fortunately, there is a fairly obvious format which we could use to efficiently serialize an arbitrary bloom filter.  Recall that the real heart of a bloom filter is the array of boolean flags, an array which can also be represented as an array of bits.  We could very easily select these bits eight at a time, combine them together using bit-shifting and produce a byte value which could be sent to an OutputStream.  Thus, a bloom filter with an array of {true, false, true, true, true, false, false, true, false} would be stored as the following series of bits: 101110010, which in hexadecimal byte form becomes B9 00.

In addition to the array, we also need to store the size, k, and width of the bloom filter, otherwise it will be impossible to find anything reliably in the deserialized result.  The Scala store and load methods to accomplish this are verbose, but fairly straightforward:

class BloomSet[A] ... {
  ...
 
  def store(os: OutputStream) {
    os.write(convertToBytes(size))
    os.write(convertToBytes(k))
    os.write(convertToBytes(contents.length))
 
    var num = 0
    var card = 0
    for (b <- contents) {
      num = (num << 1) | (if (b) 1 else 0)    // construct mask
      card += 1
 
      if (card == 8) {
        os.write(num)
 
        num = 0
        card = 0
      }
    }
 
    if (card != 0) {
      os.write(num)
    }
  }
}
 
object BloomSet {
  def load[A](is: InputStream) = {
    val buf = new Array[Byte](4)
 
    is.read(buf)
    val size = convertToInt(buf)
 
    is.read(buf)
    val k = convertToInt(buf)
 
    is.read(buf)
    val width = convertToInt(buf)
 
    var contents = Vector[Boolean]()
    for (_ <- 0 until (width / 8)) {
      var num = is.read()
      var buf: List[Boolean] = Nil
 
      for (_ <- 0 until 8) {
        buf = ((num & 1) == 1) :: buf
        num >>= 1
      }
 
      contents = contents ++ buf
    }
 
    if (width % 8 != 0) {
      var buf: List[Boolean] = Nil
      var num = is.read()
 
      for (_ <- 0 until (width % 8)) {
        buf = ((num & 1) == 1) :: buf
        num >>= 1
      }
 
      contents = contents ++ buf
    }
 
    new BloomSet[A](size, k, contents)
  }
 
  ...
}

It is interesting to note that store is a good example of what David MacIver meant when he says that functional code image good code.  I actually tried to write this in a functional style to start with, but I gave up after it became horribly ugly (having four nested folds is never a good sign).  The imperative rendering (in this case) is concise and elegant, one of the many advantages of a hybrid languages like Scala.

This particular format for rendering a bloom set is almost precisely identical to the format used by Peter Cooper’s Bloomin’ Simple, the library which sits underneath WhatLanguage.  However, due to hashing differences it is impossible to use his serialized bloom filters in Scala or vice versa.

WhatLanguage in Scala

So now that we have our magic bloom filter ready and waiting, it’s time to take a crack at Peter’s fancy language identifier!  As it turns out, this is a lot simpler than the bloom filter itself.  Conceptually, all we have to do is load a different bloom filter for each language and then use that bloom filter to check each input string token individually.  For the sake of uniformity, we will convert all tokens to lower-case.

Of course, before we can load the language bloom filters we need to first store them, and we cannot store a bloom filter before it is generated.  To that end, we need to create a simple Scala script (yes, a script) which reads in a file of newline-delimited words, inserts them all into an instance of BloomSet and then stores the result in a corresponding language file.  Scala’s scripting functionality is actually quite good, especially for a statically typed language, and really deserves some better recognition.  The complete language generator script is as follows:

import com.codecommit.collection.BloomSet
import java.io.{BufferedOutputStream, File, FileOutputStream}
import scala.io.Source
 
val WIDTH = 2000000
 
def computeK(lines: Int) = (((9:Double) * WIDTH) / ((13:Double) * lines)).intValue
 
for (file <- new File("wordlists").listFiles) {
  if (!file.isDirectory) {
    println(file.getName)
    println("==========================")
 
    val src = Source.fromFile(file)
    val count = src.getLines.foldLeft(0) { (i, line) => i + 1 }
    println("  Word count: " + count)
 
    val optimalK = computeK(count)
    val init = new BloomSet[String](WIDTH, Math.min(optimalK, 100))
 
    println("  Optimal K: " + optimalK)
    println("  Actual K: " + init.k)
 
    val set = src.reset.getLines.foldLeft(init) { _ + _.trim }
 
    println("  Accuracy: " + set.accuracy)
 
    val os = new BufferedOutputStream(
        new FileOutputStream("src/main/resources/lang/" + file.getName))
    set.store(os)
    os.close()
 
    println()
  }
}

I originally created this script using JRuby, thinking of course that it would be much easier to perform a simple, one-off task like this in a dynamically typed language.  Interestingly enough, the JRuby version was about twice as long and actually took dramatically longer to run.  By “dramatically” I mean on the order of seventy times longer.  The above Scala script takes just over 30 seconds to run on my machine using the wordlists from WhatLanguage.  This stands in stark contrast to the 35 minutes required for the JRuby script.  Both scripts use the same underlying data structure to perform all the work (BloomSet), so it’s really hard to claim that one implementation was fundamentally slower than the other.  In short: JRuby does not seem terribly well-suited for computationally-intensive tasks, even when most of that computation is taking place in Java-land.  Anyway…

This script is fairly straightforward.  The one interesting landmark is the fact that it actually makes two separate passes over the wordlist file.  The first pass just counts the number of lines, while the second pass reads the words individually, converts them to lower-case and stores them in the bloom filter.  This two-pass approach allows us to calculate the optimal value for k in our bloom filter.  This contrasts with the Ruby implementation of WhatLanguage, which just uses 3 as a global default (the width is the same in both libraries).  According to my math, this per-language optimization actually ekes out an average 2-3% better accuracy in the bloom filters.  This means fewer false positives and nominally better language detection.

I say “nominally” because such a miniscule improvement in bloom filter performance actually makes very little difference in the grand scheme of things.  Most strings are decidedly one language or another, meaning that results should be essentially identical between the two implementations.  Regardless, it’s the thought that counts.  (right?)

Just for the curious, the script produces the following output when run:

  dutch
  ==========================
    Word count: 222908
    Optimal K: 6
    Actual K: 6
    Accuracy: 0.986554232499401

  english
  ==========================
    Word count: 234936
    Optimal K: 5
    Actual K: 5
    Accuracy: 0.9827068384240777

  farsi
  ==========================
    Word count: 339747
    Optimal K: 4
    Actual K: 4
    Accuracy: 0.9408664843803285

  french
  ==========================
    Word count: 629569
    Optimal K: 2
    Actual K: 2
    Accuracy: 0.7817441534106517

  german
  ==========================
    Word count: 298729
    Optimal K: 4
    Actual K: 4
    Accuracy: 0.9590696917792074

  pinyin
  ==========================
    Word count: 399
    Optimal K: 3470
    Actual K: 100
    Accuracy: 1.0

  portuguese
  ==========================
    Word count: 386393
    Optimal K: 3
    Actual K: 3
    Accuracy: 0.9148904670664727

  russian
  ==========================
    Word count: 951830
    Optimal K: 1
    Actual K: 1
    Accuracy: 0.6213162918878307

  spanish
  ==========================
    Word count: 595946
    Optimal K: 2
    Actual K: 2
    Accuracy: 0.7984358472107859

  swedish
  ==========================
    Word count: 54818
    Optimal K: 25
    Actual K: 25
    Accuracy: 0.9999999755910407

I’m not sure why Pinyin only has a vocabulary of 399 words, but that seems to be the way things are.  This means of course that we’re storing barely 400 elements in a bloom filter with a width of 2,000,000.  Needless to say, I wasn’t surprised to see that the optimal k was in the mid thousands.  Unfortunately, this is where our inefficient hash implementation comes back to bite us.  I discovered that if I were to allow a k value of 3470, even when only inserting 399 elements, the time required to process just the single language was upwards of 20 minutes.  This may have something to do with the fact that 34702 is very, very large.

To get around this problem, I cheat and cap the k value at 100 in the script.  This still produces a computed accuracy of 100% in the bloom filter, and it takes much less time to process.  A fast hash is actually even more important during lookups.  Considering that we have to check each token in a string against each and every supported language, a fast lookup is crucial to ensuring performance of the final product.  If it took even 10 ms to compute the hash of a single token when checking against the Pinyin language, the implementation would be completely unusable.

A Little API Design

Now that we have our word lists encoded as bloom filters, we can turn our attention to slightly more important problems; specifically: what do we want the API to look like?  I decided to base my Scala implementation primarily on Peter’s Ruby API.  However, there is one small wrinkle in this plan: WhatLanguage uses symbols for everything.  That’s fine and in line with Ruby idioms, and technically we could use Scala symbols if we really wanted to, but Scala is a statically typed language with a lot of powerful idioms of its own.  It would be a lot more conventional if we found a type-safe way of representing the same concepts.  To that end, I decided to go with a similar (but not identical) approach as exemplified in the following code snippet:

import com.codecommit.lang._
 
"Hello, this is a test".language       // => english
 
val wl = new WhatLanguage(english, french, spanish)
val text = "Bonjour, my name is Daniel. Estoy bien. Como estas? Êtes-vous ennuyer?"
 
wl.processText(text)   // => Map(English -> 4, French -> 5, Spanish -> 6)

The neat trick of the day is our use of static values in scope as a form of type-safe symbol.  We create a new instance of WhatLanguage, passing it three instances of class Language representing (oddly enough) the languages we wish it to use during the analysis.  We can also use “all” as a shortcut for enumerating every supported language.

Another cute little API feature is our use of the lang “package” to bring everything into scope, including two implicit conversions and a small boat-load of language values.  This is made possible by the fact that lang is not actually a package but a singleton object.  Even Scala does not allow values and functions as top-level elements in a package, but it does allow them within objects.  Our import at the head of the snippet is actually equivalent to the following bit of Java:

import static com.codecommit.lang.*;

With all of this in mind, the rest of the library is fairly easy to create.  All we need to do is design an algorithm which splits an input string into tokens, loops over each one keeping track of how many tokens are matched by each language, and finally selects the highest-scoring language from the result and returns its corresponding static value.  As one would expect, this is easily accomplished:

package com.codecommit
 
import java.io.BufferedInputStream
import com.codecommit.collection.BloomSet
 
import scala.collection.mutable.ArrayBuffer
 
object lang {
  case object all extends Language("all") {
    val langs = List(dutch, english, farsi, french, german, pinyin, 
                     portuguese, spanish, swedish)
 
    override lazy val words = BloomSet[String]()
 
    override val toString = "All languages meta-variable"
  }
 
  val dutch = Language("dutch")
  val english = Language("english")
  val farsi = Language("farsi")
  val french = Language("french")
  val german = Language("german")
  val pinyin = Language("pinyin")
  val portuguese = Language("portuguese")
  val russian = Language("russian")
  val spanish = Language("spanish")
  val swedish = Language("swedish")
 
  implicit def conversion(str: String) = new {
    val language = new WhatLanguage(all).language(str).getOrElse(null)
  }
 
  implicit def languageToString(lang: Language): String = lang.toString
 
  sealed case class Language private[lang] (file: String) {
    lazy val words = {
      val is = new BufferedInputStream(getClass.getResourceAsStream("/lang/" + file))
 
      try {
        BloomSet.load[String](is)
      } finally is.close()
    }
 
    override val toString = file(0).toUpperCase + file.substring(1)
 
    override def equals(other: Any) = other match {
      case lang: Language => toString == lang.toString
      case _ => false
    }
 
    override val hashCode = file.hashCode
  }
 
  class WhatLanguage(langs: Language*) {
 
    def language(str: String) = {
      val back = new ArrayBuffer[Language]
      var max = 0
 
      for ((lang, score) <- processText(str)) {
        if (score > max) {
          back.clear
 
          back += lang
          max = score
        } else if (score == max) {
          back += lang
        }
      }
 
      if (back.length == 1) Some(back(0)) else None
    }
 
    def processText(str: String) = {
      val langs = if (this.langs.contains(all)) all.langs else this.langs
      val prime = langs.foldLeft(Map[Language, Int]()) { _(_) = 0 }
 
      str.split("""\s+""").foldLeft(prime) { (map, token) =>
        langs.foldLeft(map) { (map, lang) =>
          if (lang.words.contains(token.toLowerCase)) {
            map(lang) = map(lang) + 1 
          } else map
        }
      }
    }
  }
}

Conclusion

Well, it’s been a fun and informative journey through the fertile land of bloom sets and simple statistical language analysis.  As usual, all of the source (and binaries) are available for download.  It’s a bit of a shame, but I didn’t get a chance to discuss a number of rather interesting features also available in my bloom filter implementation (such as concatenation of bloom filters).

Bloom filters are quite interesting in and of themselves, and definitely a useful technique that every developer should keep in their back pocket.  Peter Cooper definitely merits kudos for his implementation of WhatLanguage based on bloom filters, but despite the fact that I took the time to port it to Scala, I still don’t see much practical benefit.  At any rate, the library is now available in Scala; hopefully someone else will find it more useful than I do!

Integrating Scala into JRuby

29
Sep
2008

More and more projects (especially startups) have been choosing to build their software in multiple languages.  Rather than using SQL for the database, XML for the RPC and Java for the everything else, companies have learned that sometimes a different language can serve best in a specific area.  Ola Bini provides some guidance with regards to methodology in this area in the form of what he calls “language layers”.  He suggests that an application can be divided logically into separate layers, and for each of these layers there exists a class of language - be it dynamic, static, compiled or otherwise - which can best accomplish the task at hand.  All of that aside, there is one problem which is absolutely assured when dealing with polyglot programming: integration between the languages.

I’m told that this issue came up at the JVM languages conference this past week.  The problem of integrating very separate languages in a natural way is not an easy one, even when all languages in question are on the JVM.  So far, the only solution that anyone has been able to come up with is just to use Java as an intermediary.  After all, JRuby integrates with Java, as does Clojure, therefore JRuby has at least some path of integration with Clojure and vice versa.

The problem is that such integration is not really idiomatic.  As the title of this post implies, we’re going to consider the example of Scala and JRuby.  I’ve already written about how to create a Scala DSL for calling JRuby directly, so let’s look at the other side of the problem: it’s certainly possible to use Scala classes from within JRuby, but it isn’t exactly a pleasant proposition.  Let’s imagine that I want to make use of my Scala port of Rich Hickey’s excellent immutable persistent vector data structure:

import com.codecommit.collection.Vector
 
vec = Vector.new
vec = vec.send('$plus'.to_sym, 1)
vec = vec.update(0, 2)
 
puts vec.apply 0       # 2

Straightforward, but ugly.  Let’s break this down a little bit…  The import and instantiation are both self-explanatory, so we come to the rather cryptic invocation of send.  In case you don’t know, this is a Ruby method which makes it possible to invoke arbitrary methods on a given object, including those with illegal characters.  I happen to know that Scala will compile the + operator to a method named $plus within class Vector.  JRuby is perfectly happy to handle and forward this call as necessary, despite the fact that you could never actually declare this method in pure Ruby ($ has special meaning).  Thus, this invocation is the same as the following Scala snippet:

vec = vec + 1

In other words, append 1 to the vector and assign the resulting new vector back into vec.

Moving on, we come to the invocations of update and apply.  These should be a bit more comprehensible to those familiar with Scala’s intricacies.  In short, these methods are how you overload the parentheses and parentheses/assignment operators.  Thus, our last two lines correspond as follows:

vec = (vec(0) = 2)
println(vec(0))

This was just a trivial example, you can imagine how a more complicated Scala API could be neigh unusable in JRuby.  Intuitively though, it shouldn’t have to be this way.  After all, JRuby supports many of the same syntactic power as Scala: it has some operator overloading, closures and even more complicated features like mixins.  With all of this common ground, surely there is a way to make the two coexist more naturally?  I mean, optimally we could have something like this:

import com.codecommit.collection.Vector
 
vec = Vector.new
vec = vec + 1
vec = (vec[0] = 2)       # problematic
 
puts vec[0]              # 2

What we have just done is informally define a desired syntax for a domain-specific problem.  Does that sound like an internal DSL to anyone else?

Our goal is to create a simple Ruby API that can be required into any JRuby application to improve the integration with Scala.  To do this, we will need to find a way to convert Ruby features into their corresponding Scala features by going through Java.  Once we find this translation, we can use meta-programming and dangerously-cool Monkey Patching to augment JRuby’s existing Java integration.  In this way, we don’t have to start our integration layer from scratch, we can just “pretty up” JRuby’s existing features, taking advantage of the fact that Scala classes are Java classes.

Step One: Operators

From our example above, converting the invocation of a $plus method into a Ruby + operator seems to be the easiest challenge to tackle.  Ruby does support operator overloading; unfortunately, this support is incredibly limited when compared with Scala’s awesome power.  For example, in Scala it is possible to invent arbitrary operators, a feature which is heavily used in the Scala standard library.  Ruby has no such capability, operators are implemented using conventional techniques and hard-coded rules in the parser.

In order to avoid blowing this experiment completely out of proportion, we’re only going to implement translations for the set of conventional Ruby operators.  It would be possible to also implement Scala operators which are made by combining existing Ruby operators (e.g. the ++ operator) using techniques developed for the Superators gem, but to do so would be extremely difficult.

We saw in our original example that the Scala + operator is translated into an invocation of the $plus method.  It stands to reason that we could make use of this fact in our translation from Ruby to Scala.  The trick is finding a comprehensive list of Scala’s operators and what methods they compile to.  Fortunately, I had already investigated these translations for a different project:

Scala Operator Compiles To
= $eq
> $greater
< $less
+ $plus
- $minus
* $times
/ div
! $bang
@ $at
# $hash
% $percent
^ $up
& $amp
~ $tilde
? $qmark
| $bar
\ $bslash

Repeated iterations of an operator become corresponding repetitions of the equivalent character sequence.  Thus, ++ becomes $plus$plus.  This suggests a very natural strategy for converting Ruby operator invocations into Scala: string substitution.  We can easily Ruby operator calls using method_missing, substitute the appropriate character sequences and then attempt the modified call against the same object.  This idea, when translated into Ruby, is almost as simple as that:

OPERATORS = {"=" => "$eq", ">" => "$greater", "<" => "$less", \
        "+" => "$plus", "-" => "$minus", "*" => "$times", "/" => "div", \
        "!" => "$bang", "@" => "$at", "#" => "$hash", "%" => "$percent", \
        "^" => "$up", "&" => "$amp", "~" => "$tilde", "?" => "$qmark", \
        "|" => "$bar", "\\" => "$bslash"}
 
alias_method :__old_method_missing_in_scala_rb__, :method_missing
def method_missing(sym, *args)
  str = sym.to_s
 
  str = $&[1] + '_=' if str =~ /^(.*[^\]=])=$/
 
  OPERATORS.each do |from, to|
    str.gsub!(from, to)
  end
 
  if methods.include? str
    send(str.to_sym, args)
  else
    __old_method_missing_in_scala_rb__(sym, args)
  end
end

Right at the end, after we have transformed the method name to convert any Ruby operators into their Scala equivalents, we actually check to see if the method exists instead of blindly sending the invocation.  The reason for this is to avoid creating an infinite loop in cases where a method really doesn’t exist.  We can rely on the presence of bona fide methods due to the way that JRuby proxies Java objects into Scala.

Astute readers will notice the extra bit of regular expression checking right at the head of the method.  We didn’t cover this in our target example, but this transformation is none-the-less quite important.  Both Scala and Ruby provide mechanisms to simulate assignment to class members through method calls.  In Ruby, you just suffix the method name with =, whereas in Scala the correct suffix is _=.  This extra transformation looks for those situations and converts to the appropriate result.  Thus, Scala var fields are now accessible within JRuby using the same syntax as standard attr_reader/attr_writer methods in Ruby.

Two other operators which might be useful to enable are the [] and []= operators, generally used for collections access.  Scala doesn’t actually support these operators, reserving square brackets for declaring type parameters.  However, it does provide a very similar syntax with parentheses.  As a refresher, here is how we might use an array within Scala:

val arr = new Array(5)
arr(0) = 5
arr(1) = 4
arr(2) = 3
arr(3) = 2
arr(4) = 1
 
println(arr(1))        // 4

At compile-time, this translates into corresponding invocations of the apply and update methods of class Array.  Specifically, apply is used to retrieve data, while update is used to assign it.  These are the direct corollaries to Ruby’s [] and []=.  It would only be natural to translate invocations of these operators into corresponding calls to apply and update, and we can do this simply by extending our method_missing just a little bit:

  # ...
 
  gen_with_args = proc do |name, args|
    code = "#{name}("
 
    unless args.empty?
      for i in 0..(args.size - 2)
        code += "args[#{i}], "
      end
      code += "args[#{args.size - 1}]"
    end
 
    code += ')'
  end
 
  if str == '[]'
    eval(gen_with_args.call('apply', args))
  elsif sym.to_s == '[]='
    eval gen_with_args.call('update', args)            # doesn't work right
  elsif methods.include? str
    send(str.to_sym, args)
  else
    __old_method_missing_in_scala_rb__(sym, args)
  end
end

The gen_with_args proc is merely a nifty little utility closure used to cut down on redundancy without creating an entire method.  It actually generates a String of Ruby code which invokes the specified method with the given arguments.  This is required because JRuby’s Java integration is only so smart.  It will try to properly convert invocations of n-arity Java methods when Ruby arrays are passed as arguments, but it can only do so well.  The safest route is to just call the method, passing each element of the array as a successive argument.

All of this looks quite nice, and it seems to satisfy our requirements, but there is just one problem: Ruby doesn’t behave itself with respect to the []= operator.  Rather than taking the sensible approach and allowing the receiving class to define its return value, it actually ignores whatever value is returned from the []= method and forces it to be the final argument.  In other words, the above code will work, but it might not integrate with Scala in quite the way we would expect.  Consider our original motivating example:

import com.codecommit.collection.Vector
 
vec = Vector.new
vec = vec + 1
vec = (vec[0] = 2)       # problematic
 
puts vec[0]              # 2

Well, I said that this would be problematic.  The issue is the value of the expression vec[0] = 2 is not a new Vector instance as returned by the Vector#update method, but actually the Fixnum value 2.  Thus, the snippet above can never work.  In what is probably the most bone-headed feature of the entire language, Ruby forces every invocation of []= to return the assigned value.  This works fine (sort-of) for mutable, side-effecting implementations like Array and Scala’s ListBuffer, but it completely falls apart for immutable, functional collections like Vector.  So much for a language that “makes me smile”…

The solution of course is to always be aware of these cases where the update method does not return Unit and just use update directly rather than []=.  Thus, a working version of the given snippet would be as follows:

import com.codecommit.collection.Vector
 
vec = Vector.