Skip to content
Print

Scala Collections for the Easily Bored Part 1: A Tale of Two Flavors

21
Jul
2008

One of the most obvious things to a Java developer first coming into Scala-land is the radically different Collections API included as part of the standard library.  For the most part, we use the same frameworks and APIs in Scala as are available in Java.  This is natural, thanks to the extremely tight integration between the two languages.  So why is this one area such a startling departure from Scala’s heritage?

The answer has to do with what the Scala language is syntactically capable of handling.  Scala isn’t just an object-oriented language, it is also highly functional.  It is only natural that such an integral part of the core libraries would reflect this fact.  Unfortunately, most developers fail to take full advantage of the power offered by the collections API.  Despite the available power, most code written using Scala’s collections tends to look a lot like Java in disguise.

I had actually planned on addressing this topic in a single article.  However, Scala’s collections are so vast and powerful (sounds like one of Roddenberry’s alien consciousness beings) that it really would overrun the limits of conventional blogging to attempt to cover it all in a single post.  Despite the fact that I’ve been creating mammoth anthologies of late, I think it’s probably better to break it into bite-sized chunks.  First up: the confusing dual nature of Scala’s collections API!

A Tale of Two Flavors

The very first thing developers notice when looking at Scala’s collections is a (seemingly) odd redundancy in the specification.  Looking under the scala.collection package, we see not one, but three separate sub-packages, each containing what seem to be reimplementations of the same types.  For example, consider the following three traits:

I don’t know about you, but this confused the heck out of me the first time I really dug into Scala’s standard API.  Actually, it gets even worse when you discover that there is also a trait (and companion object), scala.collection.Map, which is actually a super-trait of the three listed above.  Seems like Dr. Odersky discovered the magic of separated namespaces and reacted like a two-year-old on espresso.

As it turns out, there’s a very logical reason for having these separated and seemingly-redundant collections packages.  Of the three, one of them can be discounted immediately as uninteresting.  The jcl package contains collections, but they are merely wrappers for the corresponding Java collections.  This allows more efficient transmutation between the Java collections API and Scala.  It is almost never necessary to use this package directly, since a number of implicit conversions are built into Scala to make the process essentially transparent.

Of course, this still leaves the immutable and mutable packages.  This distinction traces back to some of Scala’s functional roots.  As you are likely aware, Scala supports both mutable and immutable variables, as denoted by the var and val keywords, respectively.  While there are some significant differences at compile-time, conceptually, the only distinction between these two types is the former allows reassignment whereas the latter does not.  Mutable variables are a common – and indeed, essential – feature in imperative languages (Java, C++, Ruby, etc).  For example, here’s how we would sum an array of integers in Java:

int[] numbers = ...
 
int sum = 0;
for (int n : numbers) {
    sum += n;   // reassign sum to accumulate n
}

In this case, sum is a mutable variable which accumulates the total value of all numbers in the array, numbers.  Theoretical disputes aside, this style of programming is simply impossible in certain languages.  For example, SML provides no mechanism for declaring mutable variables.  So if we want to sum the values in a list of ints, we have to do it in some other way (code provided for the curious):

fun sumList ls = foldl (op +) 0 ls

In Scala, both of these techniques are available to us.  However, despite providing for mutable state, Scala does encourage developers to avoid it.  The reasoning behind this is that mutable state has a tendency to make code more difficult to reason about, making testing much harder.  Also, as any experienced developer will tell you, mutable state kills concurrency.

Not only does Scala encourage the use of immutable variables, it also encourages the use of immutable collections.  This concept may seem a little bizarre to those of you coming from an imperative background (I know it did to me), but it actually works.  While a map container which cannot be modified probably seems a little useless, it actually provides a startling degree of freedom from concerns like concurrent locking and unintended side-effects.  With immutable collections, you are able to manipulate objects with perfect assuredness that no method call will “accidentally” alter its contents.  How many times have you cloned a collection prior to returning it?  Or how often have you dug through someone else’s code just to assure yourself that it is safe to call a given method passing an instance of List?  Immutable data structures completely solve that problem.

Naturally, immutable collections require very different patterns and idioms than those which are mutable.  My ML code from above illustrates this to a very small degree, using a fold to traverse an immutable list.  As a general rule, immutable data structures are only useful when being passed around via method calls.  A common pattern is to build up a data structure recursively, creating a new instance with one more element at each invocation:

def toSet[T](list: List[T]) = {
  def traverse(list: List[T])(set: Set[T]): Set[T] = list match {
    case hd :: tail => traverse(tail)(set + hd)   // create a new Set, adding hd
    case Nil => set
  }
 
  traverse(list)(Set[T]())
}
 
val names = List("Daniel", "Chris", "Joseph", "Renee")
val nameSet = toSet(names)

At each recursive invocation of traverse, a new Set is created, based on the contents of the old one, set, but with one additional member, hd.  At the same time, we are deconstructing an immutable List, list, selecting its first element and whatever remains at each step.  Whenever you work with immutable data structures, you will see a lot of code which looks like this.

Of course, the natural question which comes to mind is, what about performance?  If each invocation actually creates a brand new Set for every recursive call, doesn’t that require a lot of inefficient object copying and heap operations?  Well, as it turns out, this isn’t really the case.  Yes, a new instance must be created at each turn, which is is a comparatively expensive operation on the JVM, but almost nothing is being copied.  All of Scala’s immutable data structures have a property known as persistence, which means that you don’t copy the data out of the old container when creating a new, you just have the new reference the old and treat all of its contents as if they were its own.  A linked list is a good example of this, since each node of a list contains exactly one element, as well as a reference to another node.  If we think of each node as a representative of a list starting with itself and traversing to the end, then list suddenly becomes a fully persistent data structure (since the new list contains a sub-list in its entirety and additive operations require no data copying).  Rich Hickey, the creator of Clojure (a Lisp dialect running on the JVM), has an excellent presentation which explains some of the hows and whys behind this technique (as well as some other interesting topics).  Chapter 19 of Programming in Scala (Odersky, Spoon & Venners) also has a good example of a persistent immutable queue.

Conclusion

I happen to like immutable data, so most of this series uses the scala.collection.immutable package.  However, there are certainly situations where mutable data structures are the only way to go, either because of system requirements or for performance reasons.  Fortunately, Scala’s mutable collections have an almost identical interface to its immutable collections.  Thus, most of the in formation presented here is applicable to both branches of the library.

Now that we have laid a basic foundation regarding the fundamentals of Scala’s collections framework, we can move onto more interesting things.  The next installment will deal with fold and map, the bread and butter of every collection.

Comments

  1. “Fortunately, Scala’s mutable collections have an almost identical interface to its immutable collections.”

    This is one of the reasons that abstracting over a type constructor can bring great rewards without the repetition. Frankly, I’m very unimpressed with the core Scala library; oh except for scala.Either of course; that was written by a really handsome bloke :)

    I have some libraries that abstract over type constructors in Scalaz 2, but I have rewritten it all for Scalaz 3 (hoping to release this week) with some much better examples. This defers the choice of whether you wish to use mutable or immutable data types in your own code.

    See Monad, Functor, Applicative, MonadPlus, MonadEmpty, Foldable, etc.

    Tony Morris Monday, July 21, 2008 at 1:58 am
  2. Yeah, no arguments here. :-) To be honest l, I really don’t understand why Scala’s collections are so constrained. The type system can certainly handle a better design (as your work with Scalaz shows). Hopefully, the LAMP team will take some inspiration from you and fix the defects in a future release.

    With that said, it’s really not all that bad for practical purposes. Obviously, it’s annoying that methods like map work on Iterable and not the specific subtype. In normal usage, however, it isn’t really a big deal. If nothing else, even crippled Scala collections are far more powerful than their Java counterparts. :-)

    Daniel Spiewak Monday, July 21, 2008 at 11:16 am
  3. Frankly, I still fail to see the problems with Scala collections. In your example, you could have directly written:
    val nameSet = Set() ++ names
    Such idioms to convert between different types of collections are not so bad, right? You are undoubtedly aware of this, so I’m surprised you did not mention it.

    Chris de Vreeze Saturday, August 23, 2008 at 5:40 am
  4. I chose toSet() as a simple example to show how to use immutable collections in a semi-practical setting. Obviously you would use something a bit more compact (like your idiom) in the real world. :-)

    As to the problems… Take a gander at the signature for List#flatMap. By all rights, its function parameter should be constrained to return a value of type List[B]. However, this is not possible conventionally due to the fact that method parameters vary *contravariantly* and return types vary *covariantly*. This means that the only sound overriding of the flatMap method would have to define the function parameter to either return Iterable[B] (as in the superclass) or any supertype of Iterable[B]. Very annoying restriction.

    Higher kinds allow us to get around this, but Scala’s collections API was built before the introduction of higher-kind inference. I’m not entirely sure why the collections API hasn’t been rewritten to take advantage of this powerful addition to the type system, but unfortunately that’s the way it is.

    Daniel Spiewak Saturday, August 23, 2008 at 9:38 am
  5. Ok, I see what you are getting at. Btw, thanks for your blogs on Scala for Java developers. They are really helpful.

    Chris de Vreeze Saturday, August 23, 2008 at 10:33 am
  6. So… whiy is it that, according to Orderksy et al in Programming in Scala…

    To get an instance of an array you write: new Array
    But to get an instance of a Set you type: Set()

    This irregularity is truly ugly.

    jeffpk Monday, December 28, 2009 at 8:52 pm
  7. Actually, it’s not so much of an irregularity. Array is a specific implementation of Seq, whereas Set is a generic abstraction. When you say Set(), the result is an instance of some implementation, but you don’t know which one. Because Array is just an interface to direct, random-access memory, you can’t have multiple implementations.

    In practice, arrays aren’t used all that much in Scala. Most of the time, I find myself using toArray from some other Seq built in a functional manner, and then only when I have to interact with a legacy Java API which requires raw arrays.

    Daniel Spiewak Monday, December 28, 2009 at 11:24 pm
  8. Daniel,

    I’m a bit confused about 2 examples in Programming in Scala:

    var jetSet = Set (“Boeing”, “Airbus”) //immutable set
    jetSet += “Lear” // but we can still change it?

    and this as well:

    import scala.collection.mutable.Set
    val movieSet = Set(“Hitch”,”Poltegeist”)
    movieSet += “Shrek”

    What’s the difference between these?

    Ali Tuesday, February 2, 2010 at 5:43 am
  9. @Ali

    In the first example, you do NOT change the immutable Set,
    but you do assign a (reference to) new Set to variable jetSet.

    In the second example you modify the original set.

    Try without the import:
    scala> val movieSet = Set(“Hitch”, “Poltegeist”)
    movieSet: scala.collection.immutable.Set[java.lang.String] = Set(Hitch, Poltegeist)

    scala> movieSet += “Scary Movie”
    :6: error: reassignment to val
    movieSet += “Scary Movie”
    ^

    Design Pattern Friday, February 19, 2010 at 12:45 pm
  10. In Java I just stumbled over the concurrent modification exception and I wonder how this would be solved in Scala. We have classes which return collections. One thread loops over the collection and another one adds an element to the original, and bum. In Java I would make a defensive copy to avoid that.
    How to solve that in Scala?

    Haug Bürger Sunday, November 14, 2010 at 6:29 am
  11. Haug,

    Your example is a good example of why immutable collections are to be preferred over mutable ones; I’ve had the same problem in my Java code too. I suspect that the particular multiple-thread problem arises due to a design error rather than a coding error – you are trying to change something in one thread whilst you’re using it in another thread.

    Mutable objects can only be shared when the right amount of locking is used; too little locking and you get race conditions (which is detected programmatically by the iterator in the example you describe); too much locking and you get poor performance and even deadlocks.

    The java.util.concurrent classes go a long way to provide mutable data structures that can be safely shared. But they still fall short of what might be desirable in terms of simplicity, reliability and good performance all combined.

    It’s not news that immutable objects make concurrency easy – this has been known for a long time. However, the OO paradigm has struggled to make good use of this knowledge. Fortunately, Scala is introducing these Good Things to the Java world at large.

    Incidentally you don’t even need more than one thread to suffer from ConcurrentModificationExceptions. Just pass a list to a method and try to modify it. Now call this method from within an iterator loop over the same list. Bang. This is an example of aliasing going wrong, which is another OO weakness that functional programming addresses.

    I like the way Scala has a foot in both OO and FP paradigms and makes it easy to transition towards FP, bringing reliability and testability benefits.

    Rick

    Rick Beton Monday, December 13, 2010 at 5:58 pm

Post a Comment

Comments are automatically formatted. Markup are either stripped or will cause large blocks of text to be eaten, depending on the phase of the moon. Code snippets should be wrapped in <pre>...</pre> tags. Indentation within pre tags will be preserved, and most instances of "<" and ">" will work without a problem.

Please note that first-time commenters are moderated, so don't panic if your comment doesn't appear immediately.

*
*