Skip to content
Print

More Persistent Vectors: Performance Analysis

1
Sep
2008

In my previous post, I introduced the concept of "persistent vectors" and walked through one implementation of the idea.  When I actually pushed the post out, I was pretty happy with my code, but it seems I still have much to learn.  :-)   A number of very smart people replied, suggesting ways that the implementation could be cleaned up and improved.  Among these intrepid commenters was David MacIver, who correctly pointed out the similarities between my persistent vector and his IntMap class (coming in Scala 2.7.2).  Needless to say, my interest was piqued, so over the course of last week, I spent a fair amount of time going over his implementation as well as the implementations proposed by researchers in the past.

I also took the time to create a proper performance test suite for my vector implementation, one which could compare with other conceptually-similar implementations in a controlled and repeatable environment.  The results were both interesting and instructive, so I figured I would take the time to share them here.

Essentially, the performance suite runs through six tests, each of which designed to illuminate either a strength or a weakness in the underlying implementation.  These tests are run against four different classes:

  • Vector (my persistent implementation)
  • ArrayBuffer (from scala.collection.mutable)
  • IntMap (David MacIver)
  • Map

The addition of the last test target was more curiosity than anything else.  I wanted to see just how improved was IntMap over Map for integer keys.  The results turned out to be rather surprising:

  Time Memory
  Vector ArrayBuffer IntMap Map Vector ArrayBuffer IntMap Map
Fill Sequential 190.51ms 15.39ms 37.15ms 115.14ms 67.11 MB 3.93 MB 22.29 MB 12 MB
Fill Random 392.98ms 2028.43ms 128.35ms 103.3ms 64.97 MB 513.59 MB 39.89 MB 10.94 MB
Read Sequential 28.01ms 3.83ms 23.21ms 35.24ms 6.67 MB 0.02 KB 0.02 KB 3.37 MB
Read Random 92.8ms 11.14ms 54.81ms 63.8ms 8.01 MB 0.02 KB 0.02 KB 2.01 MB
Reverse 0.02ms 0.01ms - - 0.09 KB 0.04 KB - -
Compute Length 0.01ms 0.01ms 5.11ms 0.3ms 0.02 KB 0.02 KB 0.02 KB 2.23 MB

As you can see, IntMap triumphed over the other immutable data structures (including Vector) at just about every turn.  To understand why this is surprising, we need to spend a little time examining the theoretical properties of the two primary implementations: IntMap and Vector.

Patricia Tries

I already spent a fair bit of time explaining the concept of partitioned tries in the previous article, so I’m not going to reiterate all of that information here.  In a nutshell, the implementation of Vector is based upon the concept of a trie with an extremely high branching factor where the path to each node encodes its index.  Unlike List, the data structure is not fully persistent, meaning that some data copying must take place upon insert.  Specifically, a new array of branches must be allocated for the specific parent node of the inserted value and so on recursively on to the root.  The advantage to this partially-persistent implementation is that we can take advantage of the constant access time afforded by the use of arrays under the surface.  The unfortunate truth is that fully persistent data structures do not offer constant access time (at least, none that I know of), and thus are generally unsuitable for implementing random-access vectors.

The idea for this implementation comes originally from Phil Bagwell (interestingly enough, a researcher at LAMP, Martin Odersky’s research department at EPFL) in a paper entitled "Ideal Hash Trees".  His original concept though was for a more efficient hash table data structure, not necessarily with immutability as a requirement.  This concept was then adapted by Rich Hickey for his language, Clojure.  Finally, I expanded upon Clojure’s persistent vectors somewhat by changing their trie distribution from little- to big-endian and wrote up the result in Scala.  There are some other minor differences between Hickey’s design and my own, but the data structures are essentially identical.

Like my Vector implementation, the idea for IntMap began its life as a research paper, this time by Chris Okaski and Andrew Gill.  This paper is quite an interesting read if you’ve got a spare afternoon, although it does make use of SML rather than Scala as a base language.  In summary, the idea was to create an efficient, persistent map structure for integer keys.  Superficially, this sounds quite similar to Hickey’s modification of Bagwell’s concept, but there are many important distinctions below the surface.

At the extremely lowest level, IntMap actually makes use of a structure known as a "Patricia trie" with a fixed branching factor of two.  Much like Vector, IntMap encodes the key (index) of the data node within its path.  This encoding is based on the individual bits of the index.  However, unlike Vector, the ordering is little-endian.  Also, to avoid needlessly growing trees to absurd depths, linear common sub-keys are merged into a single prefix node.  This is what differentiates Patricia tries.  To illustrate, if our branching factor were 10 and we were to store at indexes 1234 and 2234, the common prefix "234" would be represented in a single node, rather than three separate nodes trivially linked to one-another.

This use of a low branching factor in the trie is extremely useful when performing insertions.  Specifically, more of the trie structure is preserved untouched from one modification to another.  Literally, IntMap is more persistent than Vector.  While this is great for writes, it does make read times a little less efficient.  Specifically, IntMap reads are worst-case O( log2(k) ), where k is the index in question.  For random data input, the average case is reduced somewhat by the prefix collapsing, but this should not be too significant.

By contrast, Vector uses an extremely high branching factor (by default) and so offers read efficiency which is O( logb(k) ), where k is the index and b is the branching factor.  Due to the practical limitations imposed on integer length, this translates into an upper-bound of O(7), which is (for all intents and purposes) constant.  Unfortunately, this analysis does not seem to be born-out by the performance data.

Possible Explanation

The only answer I can think of to explain the disparity between IntMap and Vector (both in time and in space utilization) is the use of a List[Int] in Vector to find the target node in the data structure.  This List is required primarily because I wanted the data distribution in the trie to be optimized for sequential access, therefore requiring the trie encoding to be big-endian on the index rather than little-endian.  The unfortunate truth is there’s no clean mathematical method (that I know of) which would allow the deconstruction of a number based on its most significant value in an arbitrary base.  In fact, the implementation of computePath (as suggested by Jules) actually cheats and makes use of the fact that persistent List(s) are constructed from the tail-end:

@inline
def computePath(total: Int, base: Int) = {
  if (total < 0) {
    throw new IndexOutOfBoundsException(total.toString)
  } else {
    var back: List[Int] = Nil
    var num = total
 
    do {
      back = (num % base) :: back
      num /= base
    } while (num > 0)
 
    back
  }
}

As efficient as this method is on most modern processors, it can never be as fast as a simple bit-masking operation.  Not only that, but it requires the creation of massive numbers of small, immutable objects (cons cells).  I believe that it is this instantiation overhead which is eating up the extra memory on reads and killing the overall performance.

Unfortunately, I can’t seem to conceive a better way of doing big-endian data distribution.  So are there any clever mathy people out there who have a brilliant way of deconstructing the index head-first rather than from the tail end?  If I could do that, then I could remove the List entirely from the implementation and rely instead on in-place calculations.  Maybe then I could catch up with David’s blisteringly fast IntMap:-)

Comments

  1. It’d be interesting to also compare Scalax’s scalax.data.FastArrayBuffer

    Jorge Ortiz Monday, September 1, 2008 at 12:58 am
  2. Great! Just what I need: more super-fast data structures to make me feel depressed about myself. :-)

    Seriously though, I think FastArrayBuffer is a mutable data structure, which puts it in an entirely different league. I threw in ArrayBuffer just to be thorough, there’s really no way to compare with it when each modification requires an entirely new instance. Read times perhaps, but not writes.

    Daniel Spiewak Monday, September 1, 2008 at 1:04 am
  3. So, I just ran some benchmarks against FastArrayBuffer. I don’t think you have to worry, Daniel. :-)

    http://www.nabble.com/NotParticularlyFastArrayBuffer-td19251338.html

    Incidentally, the “in an arbitrary base” thing is probably a non-trivial part of what’s killing your performance. Powers of two will make life much happier. :-)

    David R. MacIver Monday, September 1, 2008 at 2:22 am
  4. Incidentally, I really really hate that you have graphical smileys turned on in the comments. :-(

    David R. MacIver Monday, September 1, 2008 at 2:22 am
  5. I know you’ve said what you have built is essentially the same as Clojure’s persistent vector, but it seems not to be, at least in some critical details. Here are the numbers I get for a run from Clojure on my machine:

    Fill sequential: 35.97 msecs
    Fill random: 248.78 msecs
    Read sequential: 4.21 msecs
    Read random: 20.48 msecs
    Compute length: 0.05 msecs
    Reverse: 0.05 msecs

    I think if you keep working on the details you should get similar results, i.e. should win on reads as you’d expected. Also note Clojure’s vectors have an optimization for fast append which postpones tree updates, giving it a good score there too.

    Rich Hickey Monday, September 1, 2008 at 7:15 am
  6. @David

    I wouldn’t mind restricting to powers of two, since 32 is a really nice, round base that satisfies just about everybody. However, I’m not seeing how that will improve things too much since I still have to find a big-endian path. I wonder if there is a funky-clever way to mask off the *top* five bits…

    Oh, yeah I’ve been bitten by the smileys too, especially when ending parentheses-enclosed series of numbers. I wonder if I’ll just have to hack into WP to turn off the smileys just for the comments.

    @Rich

    Actually, I ran all of the tests in this article on battery power, since I didn’t have time to scramble around for a power supply. :-) When I run them on AC, the results are almost uniformly twice as good as what I posted (well, except for the memory of course).

    I agree that there are some important distinctions between the Clojure implementation and mine. Clojure’s seems to be optimized for wide distribution of keys (as would be useful for a hash table). Clojure’s vector also uses bit twiddling to compute the path, which as David pointed out is much more efficient.

    I’m curious as to how you can optimize for super fast append without compromising subsequent reads? Obviously, I can just crack open your sources and try to figure it out, but as you know, Clojure’s standard library isn’t exactly the most readable Java. :-)

    Daniel Spiewak Monday, September 1, 2008 at 8:08 am
  7. @Rich

    One more thing: were your benchmarks run directly against the Java implementation or through Clojure itself? I suppose a fair test would include the host language, but regardless of how fast the Clojure interpreter may be, it wouldn’t quite be as fast as raw Java. (is it?)

    Daniel Spiewak Monday, September 1, 2008 at 8:10 am
  8. @Daniel

    The times I gave were from Clojure, which, BTW, is not an interpreter but a compiler, and can be as fast as Java.

    I think there is a fair bit of mixing of terminology in these posts, as vector and map are getting merged in both name and function. While a vector may be considered a map, a map is not a vector in the normal sense that a vector of size n has keys of 0 to n-1. Clojure has both vectors and maps, and both are based on bitmapped hash tries. Clojure’s vectors are dense (and big-endian!) – they are not arbitrary/sparse integer maps. Thus the resulting tree is full, except at the tail. The append trick is to keep the tail out of the tree until it is a full node, so the basic data structure is tree+tail. Having a separate tail adds a simple and fast compare and mask to read, but means that only 1/32 of the appends are tree operations.

    Clojure’s maps are sparse, hashed, and little-endian, and seem to be what you may have been thinking of.

    Rich Hickey Monday, September 1, 2008 at 9:30 am
  9. > Clojure’s maps are sparse, hashed, and little-endian, and
    > seem to be what you may have been thinking of.

    Ah ha! I had been under the (apparently mistaken) impression that Clojure’s vectors were just a facade on its maps. I am aware that the properties of the two are quite different, but one *can* be implemented on top of the other with a fair amount of efficiency (which is what I thought you had done).

    It’s quite impressive that you managed to get a dense big-endian bitmapping for the vectors. I was under the impression that you only used the least-significant five bits of the index recursively until an empty node was found. This concerned me since it would mean that the locality of reference would be quite bad. Additionally, such a design would (I should think) degrade the efficiency of lookups. It’s great for maps obviously, but vectors should be a bit more optimized. As you say though, the design is a bit different.

    Do you have any further documentation on the internal semantics of Clojure’s vectors and maps? As I think I mentioned in my previous article, I was just going off of your presentation to the Mass Developers group, which wasn’t exactly technically comprehensive in terms of the data structures. :-)

    Daniel Spiewak Monday, September 1, 2008 at 9:53 am
  10. Ok, I’ve spent a few hours wading through PersistentVector.java. I think I understand how it all works now. It’s insanely efficient, but to be brutally honest, horribly convoluted. :-) It’s not a very clean implementation, but it’s much faster than my clumsy attempts.

    Since we all *love* comparative benchmarks, here are the latest results:

    ================================================================================
      Fill 100000 Sequential Indexes
    --------------------------------------------------------------------------------
                Vector Time: 101.662697ms
             BitVector Time: 11.374888ms
              Vector Memory: 68720.491211 KB
           BitVector Memory: 14049.398438 KB
    
            Time Difference: 90.287809ms
      Percentage Difference: 88.811149%
    
          Memory Difference: 54671.092773 KB
    ================================================================================
    
    ================================================================================
      Read 100000 Sequential Indexes
    --------------------------------------------------------------------------------
                Vector Time: 14.773393ms
             BitVector Time: 2.675172ms
              Vector Memory: 6736.846680 KB
           BitVector Memory: 0.023438 KB
    
            Time Difference: 12.098221ms
      Percentage Difference: 81.891959%
    
          Memory Difference: 6736.823242 KB
    ================================================================================
    
    ================================================================================
      Read 100000 Random Indexes
    --------------------------------------------------------------------------------
                Vector Time: 62.755825ms
             BitVector Time: 32.797443ms
              Vector Memory: 8471.927734 KB
           BitVector Memory: 0.023438 KB
    
            Time Difference: 29.958382ms
      Percentage Difference: 47.738010%
    
          Memory Difference: 8471.904297 KB
    ================================================================================
    
    ================================================================================
      Reverse of Length 100000
    --------------------------------------------------------------------------------
                Vector Time: 0.013947ms
             BitVector Time: 0.010071ms
              Vector Memory: 0.093750 KB
           BitVector Memory: 0.039063 KB
    
            Time Difference: 0.003876ms
      Percentage Difference: 27.790923%
    
          Memory Difference: 0.054688 KB
    ================================================================================
    
    ================================================================================
      Compute Length (100000)
    --------------------------------------------------------------------------------
                Vector Time: 0.008080ms
             BitVector Time: 0.007193ms
              Vector Memory: 0.023438 KB
           BitVector Memory: 0.023438 KB
    
            Time Difference: 0.000887ms
      Percentage Difference: 10.977723%
    
          Memory Difference: 0.000000 KB
    ================================================================================

    The results pretty-much speak for themselves. Clojure’s vector implementation is really phenomenally fast, beating out both my Vector as well as David’s IntMap (not included in the above results). I bow to your superior hackery, Rich.

    The implementation is all here: http://www.codecommit.com/blog/misc/more-persistent-vectors-performance-analysis/BitVector.scala

    Daniel Spiewak Monday, September 1, 2008 at 5:04 pm
  11. I’ll admit that I was initially confused by the BitVector class name, which I expected would be an array-like structure for storing boolean values.

    Thanks for this post and the follow-up, they are very enlightening.

    James Gates Tuesday, September 9, 2008 at 4:25 am
  12. Hello i don’t now if this is still active, but i have read it and improved the algorithm for computing the path. The algorithm is faster, but it has some drawbacks.

    1. It only works with bases that are powers of 2, in case of 32 this would be 2^5
    2. the second argument to the function has to be the exponent of th 2^x which represent the base
    3. the work with the values has to be done in the computePath methode. (I have made a comment where this should happen)

    @inline
    def computePath(total: Int,baseExp:Int) = {
    if (total < 0) {
    throw new IndexOutOfBoundsException(total.toString)
    } else {
    var count = 1
    var shift = -baseExp
    var reminder = total

    while(count < reminder && count != 0){
    count <= 0){
    val value = reminder>>shift
    reminder -= value<<shift
    shift -= baseExp
    //do something with value
    }
    }
    }

    for a base of 32 you would call computePath(index,5)

    Markus Knecht Wednesday, May 19, 2010 at 6:58 am
  13. ok i think its because of the smaler and greater syms i will use % for smaller and # for greater, sorry about al the comments

    @inline
    def myComputePath(total: Int,baseExp:Int) = {
    if (total % 0) {
    throw new IndexOutOfBoundsException(total.toString)
    } else {
    var count = 1
    var shift = -baseExp
    var reminder = total

    while(count % reminder && count != 0){
    count %%= baseExp
    shift += baseExp
    }

    while(shift #= 0){
    val value = reminder##shift
    reminder -= value%%shift
    shift -= baseExp
    //do something with value
    }
    }
    }

    Markus Knecht Wednesday, May 19, 2010 at 7:09 am

Post a Comment

Comments are automatically formatted. Markup are either stripped or will cause large blocks of text to be eaten, depending on the phase of the moon. Code snippets should be wrapped in <pre>...</pre> tags. Indentation within pre tags will be preserved, and most instances of "<" and ">" will work without a problem.

Please note that first-time commenters are moderated, so don't panic if your comment doesn't appear immediately.

*
*