Skip to content

Implementing Persistent Vectors in Scala

26
Aug
2008

Oh yeah, we’re really getting into Digg-friendly titles now, aren’t we?

The topic of persistent vectors is one of those odd backwaters of functional programming that few dare to approach.  The basic idea behind it all is to try to devise an immutable data structure which has the performance characteristics of a mutable vector.  What this means for practical application is that you shouldn’t have to deal with O(n) efficiency on random access like you do with List(s).  Instead, accessing arbitrary indexes should be constant time (O(1)), as should computing its length.  Additionally, modifying an arbitrary index should be reasonably efficient – as close to constant-time as possible.

Of course, the word “modifying” is a trifle inaccurate as it implies a mutable data structure.  One of the primary qualifications for a purely functional vector is that it is completely immutable.  Any changes to the vector result in the creation of a new vector, rather than modifying the old.  Basically, it’s an immutable data structure just like any other, but one which retains the brutal efficiency of its mutable counterpart.

Unfortunately, this turns out to be a very tough nut to crack.  A number of researchers have attempted different strategies for solving the problem, none of which have been entirely successful.  Rich Hickey, the creator of Clojure, has a brilliant presentation that essentially describes the solution I have chosen.  For the impatient, the good stuff starts at about 36:00 and lasts for about ten minutes.  I’ll elaborate on the problems with functional vectors a bit more in a second, but first a bit of motivational propaganda…

Thou Shalt Not Mutate

There is a single principle which should be drilled into skulls of all programmers everywhere: mutable data structures are bad news.  Don’t get me wrong, I love ArrayList as much as the Java addict, but such structures cause serious problems, particularly where concurrency is concerned.  We can consider the trivial example where two threads are attempting to populate an array simultaneously:

private String[] names = new String[6];
private int index = 0;
 
public static void main(String[] args) {
    Thread thread1 = new Thread() {
        public void run() {
            names[index++] = "Daniel";
            names[index++] = "Chris";
            names[index++] = "Joseph";
        }
    };
 
    Thread thread2 = new Thread() {
        public void run() {
            names[index++] = "Renee";
            names[index++] = "Bethany";
            names[index++] = "Grace";
        }
    };
 
    thread1.start();
    thread2.start();
 
    thread1.join();
    thread2.join();
 
    for (String name : names) {
        System.out.println(name);
    }
}

What does this snippet print?  I don’t know.  It’s actually indeterminate.  Now we can guess that on most machines the result will be essentially interleaved between the two threads, but there is no way to guarantee this.  Part of the reason for this is the fact that arrays are mutable.  As such, they enable (and indeed, encourage) certain patterns which are highly destructive when employed asynchronously.

However, concurrency is not even the only motivation for immutable data structures.  Consider the following example:

List<String> names = ...
for (String name : names) {
    names.add(name.toUpperCase());
}

I’m sure all of us have done something like this, most likely by accident.  The result is (of course) a ConcurrentModificationException caused by the fact that we are attempting to add to a List while we are iterating over its contents.  I know that the first time I was faced with this error message I became extremely confused.  After all, no threads are being employed, so why is this a problem?

Iterators are extremely dependent on the internal state of their data structure.  Anyone who has ever implemented an iterator for a linked list or (even better) a tree will attest to this fact.  This means that generally speaking, there is no way for an iterator to guarantee correctness if that structure is changing out from underneath it (so to speak).  Things may be fine in a linear structure like a list, but as soon as you get into anything non-linear like a tree or a hash table it becomes difficult to even define what the “correct” behavior should be.  Think about it; should the iterator backtrack to hit the missing elements?  Should this backtracking include elements which have already been consumed?  What if the order changes dramatically and pre-consumed elements are now ahead of the current index?  There are a whole host of problems associated with iterating over mutable data structures, and so rather than vainly attempt to solve these issues in a sane and consistent manner, JDK collections simply throw an exception.

All of this becomes moot once you start using immutable data structures.  There is no way to modify a structure while iterating over it because there is no way to modify the structure at all!  Concurrency is not an issue because without any mutable state to require locking, every thread can operate simultaneously on the structure.  Not only is it thread safe, but it is unsynchronized and thread safe.  Immutable data structures retain all of the asynchronous throughput of non-locking implementations without any of the race conditions and non-determinacy which usually results.

A Brief Note on Idioms

At this point, the question you must be asking yourself is: “So if the data structure is immutable, what good is it?”  The answer is “for reading”.  Data structures spend most of their lives being read and traversed by other code.  Immutable data structures can be read in exactly the same fashion as mutable ones.  The trick of course is constructing that data structure in the first place.  After all, if the data structure is completely immutable, where does it come from?  A simple example from a prior article is sufficient to demonstrate both aspects:

def toSet[T](list: List[T]): Set[T] = list match {
  case hd :: tail => hd + toSet(tail)
  case Nil => Set[T]()
}

This is neither the most concise nor the most efficient way to accomplish this task.  The only purpose served by this example is to illustrate that it is very possible to build up immutable data structures without undue syntactic overhead.  You’ll notice that every time we want to “change” a data structure – either removing from the list or adding to the set – we use a function call and either pass or return the modified structure.  In essence, the state is kept entirely on the stack, with each new version of the data structure in question becoming the “changed” version from the previous operation.

This idiom is actually quite powerful and can be applied to even more esoteric (and less uniformly iterative) operations.  As long as you are willing to let execution jump from one function to the next, it becomes extremely natural to deal with such immutability.  In fact, you start to think of immutable data structures as if they were in fact mutable, simply due to the fact that you are idiomatically “modifying” them at each step.  Note that this pattern of modifying data between functions is critical to actor-based programming and any seriously concurrent application design.

Problems Abound

For the sake of argument, let’s assume that my pitiful arguments have convinced you to lay aside your heathen mutating ways and follow the path of functional enlightenment.  Let’s also assume that you’re consumed with the desire to create an application which tracks the status of an arbitrary number of buttons.  These buttons may be pressed in any order regardless of what other buttons are already pressed.  Following the path of immutability and making use of the patron saint of persistent data structures (List), you might come up with the following solution:

class ButtonStrip private (buttons: List[Boolean]) {
  def this(num: Int) = {
    this((0 until num).foldLeft(List[Boolean]()) { (list, i) =>
      false :: list
    })
  }
 
  def status(index: Int) = buttons(index)
 
  def push(index: Int) = modify(index, true)
 
  def unpush(index: Int) = modify(index, false)
 
  /**
   * Modify buttons list and return a new ButtonStrip with the results.
   */
  private def modify(index: Int, status: Boolean) = {
    val (_, back) = (buttons :\ (buttons.length - 1, List[Boolean]())) { (tuple, button) =>
      val (i, total) = tuple
      (if (i == index) status else button) :: total
    }
    new ButtonStrip(back)
  }
}

This is a horrible mess of barely-idiomatic functional code.  It’s difficult to read and nearly impossible to maintain; but it’s purely immutable!  This is not how you want your code to look.  In fact, this is an excellent example of what David MacIver would call “bad functional code“.

Perhaps even worse than the readability (or lack thereof) is the inefficiency of this code.  It’s terribly slow for just about any sort of operation.  Granted, we can imagine this only being used with a list of buttons of limited length, but it’s the principle of the thing.  The fact is that we are relying on a number of operations which are extremely inefficient with lists, most prominently length and apply() (accessing an arbitrary index).  Not only that, but we’re recreating the entire list every time we change the status of a single button, something which is bad for any number of reasons.

What we really need here is a random-access structure, something which allows us to access and “change” any index with some degree of efficiency.  Likely the most intuitive thing to do here would be to just use a good ol’ array of Boolean(s) and make a new copy of this array any time we need to change something.  Unfortunately, this is almost as bad as our copying the list every time.  Normally, when you use an immutable data structure, modifications do not require copying large amounts of data.  Our toSet example from above uses almost zero data copying under the surface, due to the way that Set and List are implemented.

Specifically, Set and List are persistent data structures.  This doesn’t mean that they live on a filesystem.  Rather, the term “persistent” refers to the fact that each instance of the collection may share significant structure with another instance.  For example, prepending an element onto an immutable list yields a new list which consists of the new element and a tail which is precisely the original list.  Thus, each list contains its predecessor (if you will) within itself.  List is an example of a fully persistent data structure; not everything can be so efficient.  Set and Map for example are usually implemented as some sort of tree structure, and so insertions require some data copying (specifically, the parent nodes).  However, this copying is minimized by the nature of the structure.  This notion of persistence in the data structure works precisely because these structures are immutable.  If you could change an element in a persistent data structure it would likely result in the modification of that same element in totally disparate instances of that structure across the entire runtime (not a pleasant outcome).

So List is persistent, arrays are not.  Even if we treat arrays as being completely immutable, the overhead of copying a potentially huge array on each write operation is rather daunting.  What we need is some sort of data structure with the properties of an array (random access, arbitrary modification) with the persistent properties of a list.  As I said before, this turns out to be a very difficult problem.

Partitioned Tries

One solution to this problem which provides a compromise between these two worlds is that of a partitioned trie (pronounced “try” by all that is sane and holy).  In essence, a partitioned trie is a tree of vectors with a very high branching factor (the number of children per node).  Each vector is itself a tree of vectors and so on.  Note that these are not not like the binary search trees that every had to create back in college, partitioned tries can potentially have dozens of branches per node.  As it turns out, it is this unusual property which makes these structures so efficient.

To get a handle on this concept, let’s imagine that we have a trie with a branching factor of 3 (much smaller than it should be, but it’s easier to draw).  Into this vector we will insert the following data:

 Index 
Data
0
Daniel
7
Chris
2
Joseph
4
Renee
1
Bethany
3
Grace
9
Karen
13
Larry
5
Moya

After all the jumbling necessary to make this structure work, the result will look like the following:

image

It’s hard to see where the “trie” part of this comes into play, so bear with me.  The important thing to notice here is the access times for indexes 0-2: it’s O(1).  This is of course a tree, so not all nodes will be one step away from the root, but the point is that we have achieved constant-time access for at least some of the nodes.  Mathematically speaking, the worst-case access time for any index n is O( log3(n) ).  Not too horrible, but we can do better.

First though, we have to lay some groundwork.  I said that the structures we were working with are tries rather than normal trees.  A trie implies that the key for each element is encoded in the path from the root to that node.  So far, it just appears that we have built a fancy tree with numeric keys and a higher branching factor than “normal”.  This would be true if all we were given is the above diagram, but consider the slightly modified version below:

image

Structurally, nothing has changed, but most of the edges have been renumbered.  It is now a bit more apparent that each node has three branches numbered from 0 to 2.  Also, with a little more thought, we can put the rest of the pieces together.

Consider the “Moya” node.  In our input table, this bit of data has an index of 5.  To find its “index” in our renumbered trie, we follow the edges from the root down to the destination node, arriving at a value of 12.  However, remember that each node has only 3 branches.  Intuitively, we should be thinking about base-3 math somewhere about now.  And indeed, converting 12 into base-3 yields a final value of 5, indicating that the index of the node is indeed encoded in the path from the root based on column.  By the way, this works on any node (try it yourself).  The path to “Karen” is 100, which converted into base-3 becomes 9, the input index of the element.

This is all fine and dandy, but we haven’t really solved our original problem yet: how to achieve constant-time access to arbitrary indexes in a persistent structure.  To really approach a solution to our problem, we must increase the branching factor substantially.  Rather than working with a branching factor of 3 (and thus, O( log3(n) ) efficiency), let’s dial the branching factor up to 32 and see what happens.

The result is completely undiagramable; but it does actually provide constant time access for indexes between 0 and 31.  If we were to take our example data set and input it into our revised trie, the result would be a single layer of nodes, numbered at exactly their logical index value.  In the worst case, the efficiency of our more complicated trie is O( log32(n) ).  Generally speaking, we can infer (correctly) that for any branching factor b and any index n, the lookup efficiency will be precisely logb(n).  As we increase the branching factor, the read-efficiency of our trie increases exponentially.  To put a branching factor of 32 into perspective, this means that the algorithmic complexity of accessing index 1,000,000 would only be 3.986!  That’s incredibly small, especially given the sheer magnitude of the index in question.  It’s not technically constant time, but it’s so incredibly small for all conceivable indexes that we can just pretend that it is.  As Rich Hickey says:

…when it’s a small enough value, I really don’t care about it.

So that takes care of the reading end of life, but what about writing?  After all, if all we needed was constant time lookups, we could just use an array.  What we really need to take care to do is ensure that modifications are also as fast as we can make them, and that’s where the tricky part comes in.

We can think of an array as a partitioned trie with a branching factor of infinity.  When we modify an array immutably, we must copy every element from the original array into a new one which will contain the modification.  This contrasts with List – effectively, a partitioned trie with a branching factor of 1 – which in the best case (prepending) requires none of the elements to be copied.  Our 32-trie is obviously somewhere in between.  As I said previously, the partitioned trie doesn’t really solve the problem of copying, it just compromises on efficiency somewhat (the difference between 1 and 3.986).

The truth is that to modify a partitioned trie, every node in the target sub-trie must be copied into a new subtrie, which then forces the copying of its level and so-on recursively until the root is reached.  Note that the contents of the nodes are not being copied, just the nodes themselves (a shallow copy).  Thus, if we return to our example 3-trie from above and attempt to insert a value at index 12, we will have to copy the “Larry” node along with our new node to form the children of a copy of the “Renee” node.  Once this is done, the “Grace” and “Moya” nodes must also be copied along with the new “Renee” to form the children of a new “Bethany”.  Finally, the “Daniel” and “Joseph” nodes are copied along with the new “Bethany” to form the children of a new root, which is returned as the modified trie.  That sounds like a lot of copying, but consider how much went untouched.  We never copied the “Karen” or the “Chris” nodes, they just came over with their parent’s copies.  Instead of copying 100% of the nodes (as we would have had it been an array), we have only copied 80%.  Considering that this was an example contrived to force the maximum copying possible, that’s pretty good!

Actually, we can do even better than this by storing the children of each node within an array (we would have to do this anyway for constant-time access).  Thus, only the array and the modified nodes need be copied, the other nodes can remain untouched.  Using this strategy, we further reduce the copying from 80% to 30%.  Suddenly, the advantages of this approach are becoming apparent.

Now of course, the higher the branching factor, the larger the arrays in question and so the less efficient the inserts.  However, insertion is always going to be more efficient than straight-up arrays so long as the inserted index is greater than the branching factor.  Considering that most vectors have more than 32 elements, I think that’s a pretty safe bet.

Implementation

I bet you thought I was going to get to this section first.  Foolish reader…

Once we have all this theoretical ground-work, the implementation just falls into place.  We start out with a Vector class parameterized covariantly on its element type.  Covariant type parameterization just means that a vector with type Vector[B] is a subtype of Vector[A] whenever B is a subtype of AList works this way as well, as do most immutable collections, but as it turns out, this sort of parameterization is unsafe (meaning it could lead to a breakdown in the type system) when used with mutable collections.  This is part of why generics in Java are strictly invariant.

Coming back down to Earth (sort of), we consider for our design that the Vector class will represent the partitioned trie.  Since each child node in the trie is a trie unto itself, it is only logical to have each of the nodes also represented by Vector.  Thus, a Vector must have three things:

  • data
  • length
  • branches

Put into code, this looks like the following:

class Vector[+T] private (private val data: Option[T], 
      val length: Int, private val branches: Seq[Vector[T]]) extends RandomAccessSeq[T] {
  def this() = this(None, 0, new Array[Vector[T]](BRANCHING_FACTOR))
 
  // ...
}

RandomAccessSeq is a parent class in the Scala collections API which allows our vector to be treated just like any other collection in the library.  You’ll notice that we’re hiding the default constructor and providing a no-args public constructor which instantiates the default.  This only makes sense as all of those fields are implementation-specific and should not be exposed in the public API.  It’s also worth noting that the branches field is typed as Seq[Vector[T]] rather than Array[Vector[T]].  This is a bit of a type-system hack to get around the fact that Array is parameterized invariantly (as a mutable sequence) whereas Seq is not.

With this initial design decision, the rest of the implementation follows quite naturally.  The only trick is the ability to convert an index given in base-10 to the relevant base-32 (where 32 is the branching factor) values to be handled at each level.  After far more pen-and-paper experimentation than I would like to admit, I finally arrived at the following solution to this problem:

def computePath(total: Int, base: Int): List[Int] = {
  if (total < 0) {
    throw new IndexOutOfBoundsException(total.toString)
  } else if (total < base) List(total) else {
    var num = total
    var digits = 0
    while (num >= base) {
      num /= base
      digits += 1
    }
 
    val rem = total % (Math.pow(base, digits)).toInt
 
    val subPath = computePath(rem, base)
    num :: (0 until (digits - subPath.length)).foldRight(subPath) { (i, path) => 0 :: path }
  }
}

As a brief explanation, if our branching factor is 10 and the input index (total) is 20017, the result of this recursive function will be List(2, 0, 0, 1, 7).  The final step in the method (dealing with ranges and folding and such) is required to solve the problem of leading zeroes dropping off of subsequent path values and thus corrupting the final coordinates in the trie.

The final step in our implementation (assuming that we’ve got the rest correct) is to implement some of the utility methods common to all collections.  Just for demonstration, this is the implementation of the map function.  It also happens to be a nice, convenient example of good functional code.  :-)

override def map[A](f: (T)=>A): Vector[A] = {
  val newBranches = branches map { vec =>
    if (vec == null) null else vec map f
  }
 
  new Vector(data map f, length, newBranches)
}

Properties

Before moving on from this section, it’s worth noting that that our implementation of the vector concept has some rather bizarre properties not held by conventional, mutable vectors.  For one thing, it has a logically infinite size.  What I mean by this is it is possible to address any positive integral index within the vector without being concerned about resize penalties.  In fact, the only addresses which throw an IndexOutOfBoundsException are negative.  The length of the vector is defined to be the maximum index which contains a valid element.  This actually mirrors the semantics of Ruby’s Array class, which also allows placement at any positive index.  Interestingly enough, the efficiency of addressing arbitrary indexes is actually worst-case much better in the persistent trie than it is in a conventional amortized array-based vector.

Vectored Buttons

Since we now have an immutable data structure with efficient random-access, we may as well rewrite our previous example of the button strip using this structure.  Not only is the result far more efficient, but it is also intensely cleaner and easier to read:

class ButtonStrip private (buttons: Vector[Boolean]) {
  def this(num: Int) = this(new Vector[Boolean])     // no worries about length
 
  def status(index: Int) = buttons(index)
 
  def push(index: Int) = new ButtonStrip(buttons(index) = true)
 
  def unpush(index: Int) = new ButtonStrip(buttons(index) = false)
}

You’ll notice that the update method is in fact defined for Vector, but rather than returning Unit it returns the modified Vector.  Interestingly enough, we don’t need to worry about length allocation or anything bizarre like that due to the properties of the persistent vector (infinite length).  Just like arrays, a Vector is pre-populated with the default values for its type.  In the case of most types, this is null.  However, for Int, the default value is 0; for Boolean, the default is false.  We exploit this property when we simply return the value of dereferencing the vector based on any index.  Thus, our ButtonStrip class actually manages a strip of infinite length.

Conclusion

The truth is that we didn’t even go as far as we could have in terms of optimization.  Clojure has an implementation of a bit-partitioned hash trie which is basically the same thing (conceptually) as the persistent vector implemented in this article.  However, there are some important differences. 

Rather than jumping through strange mathematical hoops and hacky iteration to develop a “path” to the final node placement, Clojure’s partitioned trie uses bit masking to simply choose the trailing five bits of the index.  If this node is taken, then the index is right-shifted and the next five bits are masked off.  This is far more efficient in path calculation, but it has a number of interesting consequences on the final shape of the trie.  Firstly, the average depth of the tree for random input is less, usually by around 1 level (on average).  This means that the array copying on insert must occur less frequently, but must copy more references at each step.  Literally, the trie is more dense.  This probably leads to superior performance.  Unfortunately, it also requires that the index for each value be stored along with the node, requiring more memory.  Also, this sort of “bucket trie” (to coin a phrase) is a little less efficient in lookups in the average case.  Not significantly so, but the difference is there.  Finally, this masking technique requires that the branching factor be a multiple of 2.  This isn’t such a bad thing, but it does impose a minor restriction on the flexibility of the trie.

Most importantly though, Clojure’s trie uses two arrays to contain the children: one for the head and one for the tail.  This is a phenomenally clever optimization which reduces the amount of array copying by 50% across the board.  Rather than having a single array of length 32, it has two arrays of length 16 maintained logically “end to end”.  The correct array is chosen based on the index and recursively from there on down the line.  Naturally, this is substantially more efficient in terms of runtime and heap cost on write.

At the end of the day though, these differences are all fairly academic.  Just having an idiomatic partitioned trie for Scala is a fairly significant step toward functional programming relevance in the modern industry.  With this structure it is possible to still maintain lightning-fast lookup times and decent insertion times without sacrificing the critical benefits of immutable data structures.  Hopefully structures like this one (hopefully, even better implementations) will eventually find their way into the standard Scala library so that all may benefit from their stream-lined efficiency.

Update

The implementation of Vector which I present in this article is inherently much less efficient than the one Rich Hickey created for Clojure.  I finally broke down and created a line-for-line port from Clojure’s Java implementation of PersistentVector into a Scala class.  I strongly suggest that you use this (much faster) implementation, rather than my own flawed efforts.  :-)   You can download the improved Vector here: Vector.scala.

How Do You Apply Polyglotism?

18
Aug
2008

For the past two years or so, there has been an increasing meme across the developer blogosphere encouraging the application of the polyglot methodology.  For those of you who have been living under a rock, the idea behind polyglot programming is that each section of a given project should use whatever language happens to be most applicable to the problem in question.  This makes for a great topic for arm-chair bloggers, leading to endless pontification and flame-wars on forum after forum, but it seems to be a bit more difficult to apply in the real world.

The fact is that very few companies are open to the idea of diversity in language selection.  Just look at Google, one of the most open-minded and developer-friendly companies around.  They employ some of the smartest people I know, programmers who have actually invented languages with wide-scale adoption.  However, this same company mandates the use of a very small set of languages including Python, Java, C++ and JavaScript.  If a company like Google can’t even bring itself to dabble in language diversity, what hope do we have for the Apples of the world?

A few months ago, I received an internal email from the startup company where I work.  This email was putting forth a new policy which would restrict all future developments to one of two languages: PHP or Java.  In fact, this policy went on to push for the eventual rewrite of all legacy projects which had been written in other languages including Objective-C, Ruby, Python and a fair number of shell scripts.  I was utterly flabbergasted (to say the least).  A few swift emails later, we were able to come to a more moderate position, but the prevailing attitude remains extremely focused on minimizing the choice of languages.

To my knowledge, this sort of policy is fairly common in the industry.  Companies (particularly those employing consultants) seem to prefer to keep the technologies employed to a minimum, focusing on the least-common denominator so as to reduce the requirements for incoming developer skill sets.  This is rather distressing to me, because I get a great deal of pleasure out of solving problems differently using alternative languages.  For example, I would have loved to build the clustering system at my company using the highly-scalable actor model with Scala, but the idea was shot down right out of the gate because it involved a non-mainstream language.  To be fair to my colleagues, the overall design involved was given more serious consideration, but it was always within the confines of Java, rather than the original actor-driven concept.

There is actually another aspect to this question: assuming you are allowed to use a variety of languages to "get the job done", how do you apply them?  Ola Bini has talked about the various layers of a system, but this is harder to see in practice than it would seem.  How do you define where to "draw the line" between using Java and Scala, or even the more dramatic differences between Java and JRuby or Groovy?  Of course, we can base our decision strictly on lines of code, but in that case, Scala would trump Java every time.  For that matter, Ruby would probably beat out the two of them, and I’m certainly not writing my next large-scale enterprise app exclusively in a dynamic language.

I realize this is somewhat of a cop-out post, just asking a question and never arriving at a satisfactory conclusion, but I would really like to know how other developers approach this issue.  What criteria do you weigh in making the decision to go with a particular language?  What sorts of languages work well for which tasks?  And above all, how do you convince your boss that this is the right way to go?  The floor is open, please enlighten me!  :-)

Case Classes Are Cool

11
Aug
2008

Of all of Scala’s many features, this one has probably taken the most flack over the past year or so.  Not immutable data structures or even structural types, but rather a minor variation on a standard object-oriented construct.  This is more than a little surprising, especially considering how much work they can save when properly employed.

Quick Primer

Before we get into why they’re so nice, we should probably look at what they are and how to use them.  Syntactically, case classes are standard classes with a special modifier: case.  This modifier signals the compiler to assume certain things about the class and to define certain boiler-plate based on those assumptions.  Specifically:

  • Constructor parameters become public “fields” (Scala-style, which means that they really just have an associated accessor/mutator method pair)
  • Methods toString(), equals() and hashCode() are defined based on the constructor fields
  • A companion object containing:
    • An apply() constructor based on the class constructor
    • An extractor based on constructor fields

What this means is that we can write code like the following:

case class Person(firstName: String, lastName: String)
 
val me = Person("Daniel", "Spiewak")
val first = me.firstName
val last = me.lastName
 
if (me == Person(first, last)) {
  println("Found myself!")
  println(me)
}

The output of the above is as follows:

Found myself!
Person(Daniel,Spiewak)

Notice that we’re glossing over the issue of pattern matching and extractors for the moment.  To the regular-Joe object-oriented developer, the really interesting bits are the equals() method and the automatic conversion of the constructor parameters into fields.  Considering how many times I have built “Java Bean” classes solely for the purpose of wrapping data up in a nice neat package, it is easy to see where this sort of syntax sugar could be useful.

However, the above does deserve some qualification: the compiler hasn’t actually generated both the accessors and the mutators for the constructor fields, only the accessors.  This comes back to Scala’s convention of “immutability first”.  As we all know, Scala is more than capable of expressing standard imperative idioms with all of their mutable gore, but it tries to encourage the use of a more functional style.  In a sense, case classes are really more of a counterpart to type constructors in languages like ML or Haskell than they are to Java Beans.  Nevertheless, it is still possible to make use of the syntax sugar provided by case classes without giving up mutability:

case class Person(var firstName: String, var lastName: String)
 
val me = Person("Daniel", "Spiewak")
me.firstName = "Christopher"   // call to a mutator

By prefixing each constructor field with the var keyword, we are effectively instructing the compiler to generate a mutator as well as an accessor method.  It does require a bit more syntactic bulk than the immutable default, but it also provides more flexibility.  Note that we may also use this var-prefixed parameter syntax on standard classes to define constructor fields, but the compiler will only auto-generate an equals() (as well as hashCode() and toString()) method on a case class.

Why Are They Useful?

All of this sounds quite nice, so why are case classes so overly-maligned?  Cedric Beust, the creator of the TestNG framework, even went so far as to call case classes “…a failed experiment”.

From my understanding of Scala’s history, case classes were added in an attempt to support pattern matching, but after thinking about the consequences of the points I just gave, it’s hard for me to see case classes as anything but a failure. Not only do they fail to capture the powerful pattern matching mechanisms that Prolog and Haskell have made popular, but they are actually a step backward from an OO standpoint, something that I know Martin [Odersky] feels very strongly about and that is a full part of Scala’s mission statement.

Well, he’s right…at least as far as the pattern matching bit is involved.  Case classes are almost essential for useful pattern matching.  I say “almost” because it is possible to have pattern matching in Scala without ever using a single case class, thanks to the powerful extractors mechanism.  Case classes just provide some nice, auto-generated magic to speed things along, as well as allowing the compiler to do a bit more checking than would be otherwise possible.

The point that I think Cedric (and others) have missed entirely is that case classes are far more than just a means to get at pattern matching.  Even the most stringent object-oriented developer has to admit that a slick syntax for declaring a data container (like a bean) would be a nice thing to have.  What’s more, Scala’s automatic generation of a companion object for every case class lends itself very nicely to some convenient abstractions.  Consider a scenario I ran into a few months back:

class MainWindow(parent: Shell) extends Composite(parent, SWT.NONE) {
  private lazy val display = parent.getDisplay
 
  private val panels = Map("Foreground" -> ForegroundPanel, 
                           "Background" -> BackgroundPanel, 
                           "Font" -> FontPanel)
 
  setLayout(new FillLayout())
 
  val folder = new TabFolder(this, SWT.BORDER)
  for ((text, make) <- panels) {
    val item = new TabItem(folder, SWT.NONE)
    val panel = make(folder)
 
    item.setText(text)
    item.setControl(panel)
  }
 
  def this() = this(new Shell(new Display()))
 
  def open() {
    parent.open()
    layout()
 
    while (!parent.isDisposed) {
      if (!display.readAndDispatch()) {
        display.sleep()
      }
    }
  }
}
 
case class ForegroundPanel(parent: Composite) extends Composite(parent, SWT.NONE) {
  ...
}
 
case class BackgroundPanel(parent: Composite) extends Composite(parent, SWT.NONE) {
  ...
}
 
case class FontPanel(parent: Composite) extends Composite(parent, SWT.NONE) {
  ...
}

If you ignore the SWT boiler-plate, the really interesting bits here are the Map of panels and the initialization loop for the TabItem(s).  In essence, I am making use of a cute little trick with the companion objects of each of the panel case classes.  These objects are automatically generated by the compiler extending function type: (Composite)=>ForegroundPanel, where ForegroundPanel is replaced by the case class in question.  Because each of these classes extends Composite, the inferred type of panels will be: Map[String, (Composite)=>Composite](actually, I’m cheating a bit and not giving the precise inference, only its effective equivalent)

This definition allows the iteration over the elements of panels, generating a new instance by using the value element as a function taking a Composite and returning a new Composite instance: the desired child panel.  It’s all statically typed without giving up either the convenience of a natural configuration syntax (in the panels declaration) or the familiarity of a class definition for each panel.  This sort of thing would certainly be possible without case classes, but more work would be required on my part to properly declare each companion object by hand.

Conclusion

I think the reason that a lot of staid object-oriented developers tend to frown on case classes is their close connection to pattern matching, a more powerful relative of the much-despised switch/case mechanism.  What these developers fail to realize is that case classes are really much more than that, freeing us from the boiler-plate tyranny of endless getter/setter declarations and the manual labor of proper equals() and toString() methods.  Case classes are the object-oriented developer’s best friend, just no one seems to realize it yet.

Scala Collections for the Easily Bored Part 3: All at Once

4
Aug
2008

In the previous installment of this series, we looked at how Scala’s collections provide common mechanisms for iteration, as well as many higher-order operations in the same conceptual vein.  In this, the third and final part of the series, we will examine some mechanisms for conceptually operating on the collection as a whole.  Thus, rather than transforming individual collection elements, we will be looking at ways to inspect and modify the data structure itself.

Filtration

Stepping out of Scala for a moment, let’s consider the common design paradigm of the relational database.  Imagine that we have defined a table, people, which contains hundreds of thousands of records, assembled by some despotic government.  Now perhaps the secret service in this government wants to dispatch legal enforcers to the residences of all adults with the “wrong” political leaning.  The most natural way to accomplish this task would be to perform an SQL SELECT, filtering by age and politics:

SELECT * FROM people WHERE age > 18 AND affiliation = 'Green Party'

This query would of course return a data set which could be iterated over, performing the necessary operations (in this case, incarceration) for each record.  As it turns out, this sort of use-case is not confined solely to databases.  It is often necessary to restrict or filter a data structure based on certain criteria.  As a trivial example, imagine that we have been passed a List of Int(s) and we want to remove all odd numbers.  We could accomplish this by using the filter method and passing a function value to describe the criterion:

def onlyEven(nums: List[Int]) = {
  nums.filter((n) => n % 2 == 0)
}

If we call the onlyEven method, passing List(1, 2, 3, 4, 5), the result will be List(2, 4), since 2 and 4 are the only numbers into which 2 divides evenly.  As with many of the other common collection operations, Scala includes a syntax sugar for filter.  We can rewrite the previous example using for-comprehensions in the following way:

def onlyEven(nums: List[Int]) = {
  for {
    n <- nums
    if n % 2 == 0
  } yield n
}

This is a little different from the for-comprehensions we have seen already in that we have placed the yield statement outside the braces with the comprehension definition within.  Believe it or not, this syntax is perfectly valid, coming from Haskell’s do-notation.  This construct will be translated at compile time into a corresponding invocation of filter (and map, but that isn’t terribly relevant here) prior to type checking.  This sort of notation can be very convenient for many tasks, similar to LINQ in the .NET languages.

Partition

In the previous example, we looked at how to selectively remove elements from a collection based on a single criterion.  However, sometimes the requirement is not to remove elements, but rather to separate them into different collections.  For example, we might not want to simply throw away all odd elements in a list, it might be useful to keep both even and odd lists on hand for further operation.  This can be accomplished using the partition method:

def separateEven(nums: List[Int]) = {
  nums.partition((n) => n % 2 == 0)
}

You will notice that the signature for partition looks remarkably similar to filter.  This uniformity in the API is by design, making usage both easier to remember and reducing the number of changes necessary to switch between similar operations.  However, despite the similarity in dispatch, partition returns a very different result:

val numbers = List(1, 2, 3, 4, 5, 6, 7)
val sep = separateEven(numbers)
 
sep == (List(2, 4, 6), List(1, 3, 5, 7))    // true

Literally, the partition method splits elements into two different lists, based on the boolean value of the given predicate (in this case, even or odd).  These lists are returned as a tuple, Scala’s idiom for returning multiple values from a single method.

Searching

Having the capability to filter and split an entire list is all well and good, but it is perhaps even more common to need to find a specific element within a collection.  This is most often seen when dealing with sets or maps, but it can also be useful in the context of linear structures such as list and array.  A simple example might be a trivial caching mechanism for database lookups:

val cache = new HashMap[Int, String]
 
def readData(id: Int) = {
  if (cache.contains(id)) {    // gonna find her
    cache(id)
  } else {
    val conn = createConnection()
 
    val back = try {
      val stmt = conn.prepareStatement("SELECT value FROM dataTab WHERE id = ?")
      stmt.setInt(1, id)
 
      val res = stmt.executeQuery()
      if (res.next()) {
        res.getString(1)
      } else {
        null
      }
    } finally {
      conn.close()
    }
 
    cache += (id -> back)
    back
  }
}

Unlike Java, Scala’s collections API is extremely consistent in what methods correspond to what functionality.  The contains method on a Map does in fact search based on key, not value.  However, sometimes the situation calls for a solution which is not so specific.  Looking for a particular element is great (and very efficient on maps and sets), but it isn’t the most general implementation.  A more flexible form of searching would be to match based on a higher-order function (just like filter), rather than an explicit value.  This not only allows searching for a specific element, but it also provides the ability to look for a range.  More generally, the exists method makes it possible to check to see if an element of a given collection satisfies a certain property.

val nums = List(1, 2, 3, 4, 5, 6)
nums.exists((n) => (3 to 5).contains(n))

In this example, we are literally checking the list, nums, for any values which are in the mathematical range [3, 5].  The exists method calls the predicate (our function parameter – or “lambda” if you prefer) for each element in the list until it returns true, at which point the iteration is short circuited.  The predicate itself creates a new Range and checks to see if the specified value is within that range.  As it turns out, Range itself is a collection just like any other, defining the same methods that we’ve come to know at love.

There is one final variation on the “search” theme that is worth examining: find.  While it’s great to know that some element within a collection satisfies a certain property, sometimes it is even more useful to be able to ascertain what element that was.  Thus, rather than returning a Boolean, the find method returns an instance of the Option monad containing the first satisfactory element find, or None if the property remains unsatisfied.  Adapting our example from above yields the following code and its associated result:

val nums = List(1, 2, 3, 4, 5, 6)
val elem = nums.find((n) => (3 to 5).contains(n))
 
elem == Some(3)   // true

Conclusion

Well, it’s been a rather short series, but hopefully still worth reading.  In truth, I skipped a great deal of detail and idiomatic beauty that can be found within the Scala collections API.  While the type definitions could certainly stand improvement in some areas, it is already without a doubt the best-designed collections framework I have ever used (in any language).  Literally, once you figure out how to best employ one collection type, you will have learned them all.  For further reading on the topic, you can always peruse the scaladoc, or alternatively just play around in the Scala interpreter.  Have fun!

Scala Collections for the Easily Bored Part 2: One at a Time

28
Jul
2008

As I hinted previously, this series is intended to delve into Scala’s extensive collections API and the many ways in which it can make your life easier.  Probably the most important operations you could ever perform on collections are those which examine each element, one at a time.  After all, what’s a more common array idiom than looping over all values?

In that vein, this article starts by looking at foreach, the imperative programmer’s bread-and-butter when it comes to types like Array and List.  But rather than just stopping there, we also will look at more powerful, higher-order operations like fold, map and the ever-mysterious: flatMap.

Iterating

As I said above, looping over every item in a collection is probably the most heavily used operation in a programmer’s repertoire.  In fact, this pattern is so common in imperative languages that Java 5 saw the introduction of a special construct at the language level, just to make things a little easier.  For example:

String[] names = { "Daniel", "Chris", "Joseph" };
for (String name : names) {
    System.out.println(name);
}

This code should be old hat to anyone coming from a Java background.  Under the surface, this code is compiled into a while-loop with an iterator and an increment operation.  The code steps through the array, assigning each successive element to name.  Most statically typed languages have a construct similar to this.  For example, C# offers the foreach statement, which is almost precisely similar to Java’s enhanced for-loop, but with a slightly different syntax.  However, some languages (like Ruby) idiomatically eschew loops and rely instead on higher-order methods.  Translating the above into Ruby yields the following:

names = ['Daniel', 'Chris', 'Joseph']
names.each do |name|
  puts name
end

In this case, we aren’t using a loop of any kind, but rather creating a block (Ruby’s name for a closure) which takes a single parameter and passes it to the built-in puts method.  This block is then passed as an object to the each method of class Array, which calls the block once for each element in series.  Certainly, there is a language construct which encapsulates this, but using the each method directly is considered more “Ruby”.

The same approach is taken in Scala.  Rather than define a special construct for iteration, Scala simply provides the syntactic tools needed to construct higher-order methods like Ruby’s each.  Every collection in Scala’s library defines (or inherits) a foreach method (taking after its C# ancestry) which behaves precisely like Ruby’s each.  To show how, we will translate our example once more, this time into Scala:

val names = Array("Daniel", "Chris", "Joseph")
names.foreach { name =>
  println(name)
}

Here we define an anonymous method (Scala’s name for a closure) which takes a single parameter.  As in Ruby, this closure is passed to foreach and invoked for each array element.  In this way, foreach is a a so-called “higher-order” method, due to the fact that it accepts a parameter which is itself another method.  Scala’s implementation of this concept is actually a bit more general than Ruby’s, allowing us to shorten our example into the following:

val names = Array("Daniel", "Chris", "Joseph")
names.foreach(println)

This time, instead of creating an anonymous method to pass to foreach, we simply pass the println method itself.  The only criterion that foreach imposes on its parameter is that it is a method which accepts a single parameter of type String (the element type of the array).  Since we already have just such a method (println), there is no need to define another simply for encapsulation.

Unfortunately, despite its flexibility, foreach is not always the most syntactically concise way to iterate over a collection.  There are times that we just want to use a syntax which is similar to the for-loops available in other languages.  Well, fear not!  Scala’s for-comprehensions are more than capable of providing just such a syntax.  Consider the example of imperatively summing the values of a list:

val nums = List(1, 2, 3, 4, 5)
 
var sum = 0
for (n <- nums) {
  sum += n
}

At compile-time, the above is actually translated into an equivalent call to foreach, passing the body of the loop as the anonymous method.  This means that any class which correctly declares a foreach method may be used in a for-comprehension in this way, greatly reducing the syntactic overhead.

Folding

Looping is nice, but sometimes there are situations where it is necessary to somehow combine or examine every element in a collection, producing a single value as a result.  An example of this could be our previous example of summing a list.  Using foreach, we had to make use of a mutable variable (sum) and produce the result as a side effect.  This is fine for hybrid languages like Scala, but some languages actually lack mutable variables altogether.  In the previous post, I mentioned ML, which is a pure-functional language (almost).  Since pure-functional languages lack mutable state, the gods of computing needed to come up with some other way to accommodate this particular case.  The solution is called “folding”.

Folding a collection is literally the process of looking at each element in addition to a current accumulator and returning some value.  To make things more clear, let’s re-write our list summing example in a functional style:

val nums = List(1, 2, 3, 4, 5)
val sum = nums.foldLeft(0) { (total, n) =>
  total + n
}

It may seem a bit disguised, but this too is a loop, just like foreach.  For each element in the list (starting from the left), the foldLeft method will call the anonymous method we passed to it, providing both the current element, as well as the total accumulated from the previous call.  Since there was no previous call when the first element is processed, we must specify an initial value for total – in this case, 0.  Literally, the above can be flattened into the following:

val sum = 0 + 1 + 2 + 3 + 4 + 5

Of course, we would never want to hard-code a list in this fashion, but it serves as a sufficient illustration of the functionality.  I know from experience that when you first discover fold it’s difficult to see why anyone would want to use a construct so limited.  After all, doesn’t it just serve to obscure the true meaning of the code?  Well, take my word for it, fold is an almost indispensable tool…once you get to know it a little better.  Try keeping an eye out for times in your own code where a fold might be useful instead of a conventional loop.  You’ll be surprised how often it can be used to solve a problem, sometimes one not even intuitively related to accumulation.

There’s no special language-level syntax for fold, but Scala’s powerful operator overloading mechanism has allowed the designers of the collections API to create a special operator of rather dubious readability.  To illustrate, here’s our “summing a list” example once more:

val nums = List(1, 2, 3, 4, 5)
(0 /: nums) {_+_}

Yeah, I can’t read it either.  This example is semantically equivalent to the previous fold, but its meaning is a bit obfuscated by a) the bizarre right-associative operator, and b) a mysterious cameo by Scala’s ubiquitous underscore.  While I don’t have a problem using the underscore syntax in my own code, I don’t think that the fold operator improves anything other than number of characters.  I suppose it’s a matter of taste though.

Reduce

Fold has a closely related operation in Scala called “reduce” which can be extremely helpful in merging the elements of a sequence where leading or trailing values might be a problem.  Consider the ever-popular example of transforming a list of String(s) into a single, comma-delimited value:

val names = List("Daniel", "Chris", "Joseph")
val str = names.foldLeft("") { (acc, n) =>
  acc + ", " + n
}
 
println(str)

If you compile and run this code, you will actually arrive at a result which looks like the following:

, Daniel, Chris, Joseph

This is because folding a list requires a leading value, but this means that we have an extra separator running wild.  We could try a foldRight, but this would merely move the same problem to the tail of the string.  Interestingly enough, this problem of leading/trailing separators is hardly specific to folding.  I can’t tell you how many times I ran into this issue when constructing highly-imperative query synthesis algorithms for ActiveObjects.

The easiest way to solve this problem in Scala is to simply use a reduce, rather than a fold.  As a rule, any collection which defines foldLeft will also define reduceLeft (and the corresponding right methods).  Reduce distinguishes itself from fold in that it does not require an initial value to “prime the sequence” as it were.  Rather, it starts with the very first element in the sequence and moves on to the end.  Thus, the following code will produce the desired result for our names example:

val names = List("Daniel", "Chris", "Joseph")
val str = names.reduceLeft[String] { (acc, n) =>
  acc + ", " + n
}
 
println(str)

There are of course a few small problems with this approach.  Firstly, it is not as general as a fold.  Reduce is designed primarily for the iterate/accumulate pattern, whereas fold may be applied to many problems (such as searching a list).  Also, the reduce method will throw an exception if the target collection is empty.  Finally, Scala’s type inference isn’t quite clever enough to figure out what’s going on without the explicit [String] type parameter (since our result is of type String).  Still, even with all these shortcomings, reduce can be a very powerful tool in the right hands.

Mapping

We’ve seen how fold can be an extremely useful tool for applying a computation to each element in a collection and arriving at a single result, but what if we want to apply a method to every element in a collection in-place (as it were), creating a new collection of the same type with the modified elements?  Coming from an imperative background, this probably sounds a little abstract, but like fold, map can be extremely useful in many common scenarios.  Consider the example of transforming a list of String(s) into a list of Int(s):

val strs = List("1", "2", "3", "4", "5")
val nums = strs.map { s =>
  s.toInt
}
 
nums == List(1, 2, 3, 4, 5)   // true

The final expression in this snippet is just to illustrate what really happens to the list elements when map is called.  Literally, the map method walks through each element in the list, calls the provided method and then stores the result in the corresponding index of a new list.  (list is immutable, remember?)  If you think about it, this is very similar to looping with foreach except that at each iteration we produce a value which is saved for future use.

Another common application of this technique might be to cast an entire array from one type to another.  I often make use of XMLRPC, which has the unfortunate property of stripping all type information from its composite types.  Thus, I often find myself writing code like this:

public void rpcMethod(Object[] params) {
    String[] strParams = new String[params.length];
    for (int i = 0; i < params.length; i++) {
        strParams[i] = (String) params[i];
    }
}

It’s a lot of trouble to go through, but I really don’t know any better way.  We can’t just cast the array to String[], since the array itself is not of type String[], only its elements match that type.  Java doesn’t support higher-order operations such as map, but fortunately Scala does.  We can translate the above into a functional style and gain tremendously in both readability and conciseness:

def rpcMethod(params: Array[Object]) {
  val strParams = params.map { _.asInstanceOf[String] }
}

For the sake of brevity, you’ll notice that I made use of the underscore syntax as a placeholder for the anonymous method parameter.  This syntax works remarkably well for short operations like the above, where all we need to do is take the input value and cast it to a new type.

As it turns out, mapping over a collection is a phenomenally common operation, perhaps even more so than fold.  For that reason, the creators of Scala decided that it was worth adding a special syntax sugar built into the powerful for-comprehension mechanism.  With a little bit of tweaking, we can transform our casting example into an arguably more readable form:

def rpcMethod(params: Array[Object]) {
  val strParams = for (p <- params) yield p.asInstanceOf[String]
}

At compile-time, these two forms are literally equivalent.  In some ways it is a matter of taste as to which is better.  I personally tend to favor directly calling map for simple, non-combinatorial operations like this, but to each his own.

Binding

Actually, the name “bind” comes from Haskell.  Scala’s term for this operation is “flatMap” because the operation may be viewed as a combination of the map and flatten methods.  Of all of the techniques discussed so far, this one probably has the richest theoretical implications.  Coming straight from the menacing jungles of category theory and the perplexing wasteland of monads, flatMap is both intriguing and apparently useless (at first glance anyway).

Like map, flatMap walks through every element in a collection and applies a given function, saving the value for later use.  However, unlike map, flatMap expects the return type of the specified function to be the same as the enclosing collection with an optionally different type parameter.  If we look at this in terms of our number-converting example from previously, this means that our anonymous method must not return a value of type Int, but rather of type List[Int].  Once flatMap has all of the resultant List[Int] values, it flattens them into a single list containing all of the elements from each of the inner lists.

Ok, that was utterly un-helpful.  Maybe the method signature would be more illuminating:

class List[A] {   // don't try this at home
  def flatMap[B](f: (A)=>List[B]): List[B]
}

Other than forcing order of evaluation, I can’t personally think of too many common cases where this sort of operation is useful.  However, one contrived example does spring to mind.  Consider once more the example of converting a list of String(s) into a list of Int(s), but this time assume that some of the String elements might not nicely convert into integer values:

val strs = List("1", "two", "3", "four", "five")
val nums = strs.flatMap { s =>
  try {
    List(s.toInt)
  } catch {
    case _ => Nil
  }
}
 
nums == List(1, 3)    // true

Recall that in Scala, everything is an expression – including try/catch blocks – therefore, everything has a value.  This code literally walks through the entire list and tries to convert each element into an integer and wrap the result in a List.  If the conversion fails (for whatever reason), an empty list is returned (Nil).  Because we return an empty list for those elements which cannot be converted, flatMap literally resolves those results out of existence, leading to a List which only contains two Int(s).  For the monadically uninclined among us, this is precisely the reason why Nil is referred to as the “zero” of the List monad.  However, that’s a topic for an entirely different series

Conclusion

Ok, so this article was a bit longer than I really wanted to run, but there’s a lot of material to cover!  Scala’s collections framework shows how even operations steeped in mountains of theory can still prove useful in solving common problems.  Now, every time I use collections in Java (or even Ruby), I find myself reaching for many of these same methods, only to find them either unavailable or less powerful than I would like.  Scala provides a welcome refuge for all those of us who desire more powerful collection types in our daily programming.

Be with us next time for filter, forall, exists and more!