Sep
2008
In my previous post, I introduced the concept of "persistent vectors" and walked through one implementation of the idea. When I actually pushed the post out, I was pretty happy with my code, but it seems I still have much to learn. :-) A number of very smart people replied, suggesting ways that the implementation could be cleaned up and improved. Among these intrepid commenters was David MacIver, who correctly pointed out the similarities between my persistent vector and his IntMap class (coming in Scala 2.7.2). Needless to say, my interest was piqued, so over the course of last week, I spent a fair amount of time going over his implementation as well as the implementations proposed by researchers in the past.
I also took the time to create a proper performance test suite for my vector implementation, one which could compare with other conceptually-similar implementations in a controlled and repeatable environment. The results were both interesting and instructive, so I figured I would take the time to share them here.
Essentially, the performance suite runs through six tests, each of which designed to illuminate either a strength or a weakness in the underlying implementation. These tests are run against four different classes:
Vector(my persistent implementation)ArrayBuffer(fromscala.collection.mutable)IntMap(David MacIver)Map
The addition of the last test target was more curiosity than anything else. I wanted to see just how improved was IntMap over Map for integer keys. The results turned out to be rather surprising:
| Time | Memory | ||||||||
| Vector | ArrayBuffer | IntMap | Map | Vector | ArrayBuffer | IntMap | Map | ||
| Fill Sequential | 190.51ms | 15.39ms | 37.15ms | 115.14ms | 67.11 MB | 3.93 MB | 22.29 MB | 12 MB | |
| Fill Random | 392.98ms | 2028.43ms | 128.35ms | 103.3ms | 64.97 MB | 513.59 MB | 39.89 MB | 10.94 MB | |
| Read Sequential | 28.01ms | 3.83ms | 23.21ms | 35.24ms | 6.67 MB | 0.02 KB | 0.02 KB | 3.37 MB | |
| Read Random | 92.8ms | 11.14ms | 54.81ms | 63.8ms | 8.01 MB | 0.02 KB | 0.02 KB | 2.01 MB | |
| Reverse | 0.02ms | 0.01ms | - | - | 0.09 KB | 0.04 KB | - | - | |
| Compute Length | 0.01ms | 0.01ms | 5.11ms | 0.3ms | 0.02 KB | 0.02 KB | 0.02 KB | 2.23 MB | |
As you can see, IntMap triumphed over the other immutable data structures (including Vector) at just about every turn. To understand why this is surprising, we need to spend a little time examining the theoretical properties of the two primary implementations: IntMap and Vector.
Patricia Tries
I already spent a fair bit of time explaining the concept of partitioned tries in the previous article, so I’m not going to reiterate all of that information here. In a nutshell, the implementation of Vector is based upon the concept of a trie with an extremely high branching factor where the path to each node encodes its index. Unlike List, the data structure is not fully persistent, meaning that some data copying must take place upon insert. Specifically, a new array of branches must be allocated for the specific parent node of the inserted value and so on recursively on to the root. The advantage to this partially-persistent implementation is that we can take advantage of the constant access time afforded by the use of arrays under the surface. The unfortunate truth is that fully persistent data structures do not offer constant access time (at least, none that I know of), and thus are generally unsuitable for implementing random-access vectors.
The idea for this implementation comes originally from Phil Bagwell (interestingly enough, a researcher at LAMP, Martin Odersky’s research department at EPFL) in a paper entitled "Ideal Hash Trees". His original concept though was for a more efficient hash table data structure, not necessarily with immutability as a requirement. This concept was then adapted by Rich Hickey for his language, Clojure. Finally, I expanded upon Clojure’s persistent vectors somewhat by changing their trie distribution from little- to big-endian and wrote up the result in Scala. There are some other minor differences between Hickey’s design and my own, but the data structures are essentially identical.
Like my Vector implementation, the idea for IntMap began its life as a research paper, this time by Chris Okaski and Andrew Gill. This paper is quite an interesting read if you’ve got a spare afternoon, although it does make use of SML rather than Scala as a base language. In summary, the idea was to create an efficient, persistent map structure for integer keys. Superficially, this sounds quite similar to Hickey’s modification of Bagwell’s concept, but there are many important distinctions below the surface.
At the extremely lowest level, IntMap actually makes use of a structure known as a "Patricia trie" with a fixed branching factor of two. Much like Vector, IntMap encodes the key (index) of the data node within its path. This encoding is based on the individual bits of the index. However, unlike Vector, the ordering is little-endian. Also, to avoid needlessly growing trees to absurd depths, linear common sub-keys are merged into a single prefix node. This is what differentiates Patricia tries. To illustrate, if our branching factor were 10 and we were to store at indexes 1234 and 2234, the common prefix "234" would be represented in a single node, rather than three separate nodes trivially linked to one-another.
This use of a low branching factor in the trie is extremely useful when performing insertions. Specifically, more of the trie structure is preserved untouched from one modification to another. Literally, IntMap is more persistent than Vector. While this is great for writes, it does make read times a little less efficient. Specifically, IntMap reads are worst-case O( log2(k) ), where k is the index in question. For random data input, the average case is reduced somewhat by the prefix collapsing, but this should not be too significant.
By contrast, Vector uses an extremely high branching factor (by default) and so offers read efficiency which is O( logb(k) ), where k is the index and b is the branching factor. Due to the practical limitations imposed on integer length, this translates into an upper-bound of O(7), which is (for all intents and purposes) constant. Unfortunately, this analysis does not seem to be born-out by the performance data.
Possible Explanation
The only answer I can think of to explain the disparity between IntMap and Vector (both in time and in space utilization) is the use of a List[Int] in Vector to find the target node in the data structure. This List is required primarily because I wanted the data distribution in the trie to be optimized for sequential access, therefore requiring the trie encoding to be big-endian on the index rather than little-endian. The unfortunate truth is there’s no clean mathematical method (that I know of) which would allow the deconstruction of a number based on its most significant value in an arbitrary base. In fact, the implementation of computePath (as suggested by Jules) actually cheats and makes use of the fact that persistent List(s) are constructed from the tail-end:
@inline def computePath(total: Int, base: Int) = { if (total < 0) { throw new IndexOutOfBoundsException(total.toString) } else { var back: List[Int] = Nil var num = total do { back = (num % base) :: back num /= base } while (num > 0) back } } |
As efficient as this method is on most modern processors, it can never be as fast as a simple bit-masking operation. Not only that, but it requires the creation of massive numbers of small, immutable objects (cons cells). I believe that it is this instantiation overhead which is eating up the extra memory on reads and killing the overall performance.
Unfortunately, I can’t seem to conceive a better way of doing big-endian data distribution. So are there any clever mathy people out there who have a brilliant way of deconstructing the index head-first rather than from the tail end? If I could do that, then I could remove the List entirely from the implementation and rely instead on in-place calculations. Maybe then I could catch up with David’s blisteringly fast IntMap.
- Vector.scala (modified since last time)
- VectorPerfTest.scala


Comments
It’d be interesting to also compare Scalax’s scalax.data.FastArrayBuffer
Great! Just what I need: more super-fast data structures to make me feel depressed about myself.
Seriously though, I think FastArrayBuffer is a mutable data structure, which puts it in an entirely different league. I threw in ArrayBuffer just to be thorough, there’s really no way to compare with it when each modification requires an entirely new instance. Read times perhaps, but not writes.
So, I just ran some benchmarks against FastArrayBuffer. I don’t think you have to worry, Daniel.
http://www.nabble.com/NotParticularlyFastArrayBuffer-td19251338.html
Incidentally, the “in an arbitrary base” thing is probably a non-trivial part of what’s killing your performance. Powers of two will make life much happier.
Incidentally, I really really hate that you have graphical smileys turned on in the comments.
I know you’ve said what you have built is essentially the same as Clojure’s persistent vector, but it seems not to be, at least in some critical details. Here are the numbers I get for a run from Clojure on my machine:
Fill sequential: 35.97 msecs
Fill random: 248.78 msecs
Read sequential: 4.21 msecs
Read random: 20.48 msecs
Compute length: 0.05 msecs
Reverse: 0.05 msecs
I think if you keep working on the details you should get similar results, i.e. should win on reads as you’d expected. Also note Clojure’s vectors have an optimization for fast append which postpones tree updates, giving it a good score there too.
@David
I wouldn’t mind restricting to powers of two, since 32 is a really nice, round base that satisfies just about everybody. However, I’m not seeing how that will improve things too much since I still have to find a big-endian path. I wonder if there is a funky-clever way to mask off the *top* five bits…
Oh, yeah I’ve been bitten by the smileys too, especially when ending parentheses-enclosed series of numbers. I wonder if I’ll just have to hack into WP to turn off the smileys just for the comments.
@Rich
Actually, I ran all of the tests in this article on battery power, since I didn’t have time to scramble around for a power supply.
When I run them on AC, the results are almost uniformly twice as good as what I posted (well, except for the memory of course).
I agree that there are some important distinctions between the Clojure implementation and mine. Clojure’s seems to be optimized for wide distribution of keys (as would be useful for a hash table). Clojure’s vector also uses bit twiddling to compute the path, which as David pointed out is much more efficient.
I’m curious as to how you can optimize for super fast append without compromising subsequent reads? Obviously, I can just crack open your sources and try to figure it out, but as you know, Clojure’s standard library isn’t exactly the most readable Java.
@Rich
One more thing: were your benchmarks run directly against the Java implementation or through Clojure itself? I suppose a fair test would include the host language, but regardless of how fast the Clojure interpreter may be, it wouldn’t quite be as fast as raw Java. (is it?)
@Daniel
The times I gave were from Clojure, which, BTW, is not an interpreter but a compiler, and can be as fast as Java.
I think there is a fair bit of mixing of terminology in these posts, as vector and map are getting merged in both name and function. While a vector may be considered a map, a map is not a vector in the normal sense that a vector of size n has keys of 0 to n-1. Clojure has both vectors and maps, and both are based on bitmapped hash tries. Clojure’s vectors are dense (and big-endian!) - they are not arbitrary/sparse integer maps. Thus the resulting tree is full, except at the tail. The append trick is to keep the tail out of the tree until it is a full node, so the basic data structure is tree+tail. Having a separate tail adds a simple and fast compare and mask to read, but means that only 1/32 of the appends are tree operations.
Clojure’s maps are sparse, hashed, and little-endian, and seem to be what you may have been thinking of.
> Clojure’s maps are sparse, hashed, and little-endian, and
> seem to be what you may have been thinking of.
Ah ha! I had been under the (apparently mistaken) impression that Clojure’s vectors were just a facade on its maps. I am aware that the properties of the two are quite different, but one *can* be implemented on top of the other with a fair amount of efficiency (which is what I thought you had done).
It’s quite impressive that you managed to get a dense big-endian bitmapping for the vectors. I was under the impression that you only used the least-significant five bits of the index recursively until an empty node was found. This concerned me since it would mean that the locality of reference would be quite bad. Additionally, such a design would (I should think) degrade the efficiency of lookups. It’s great for maps obviously, but vectors should be a bit more optimized. As you say though, the design is a bit different.
Do you have any further documentation on the internal semantics of Clojure’s vectors and maps? As I think I mentioned in my previous article, I was just going off of your presentation to the Mass Developers group, which wasn’t exactly technically comprehensive in terms of the data structures.
Ok, I’ve spent a few hours wading through PersistentVector.java. I think I understand how it all works now. It’s insanely efficient, but to be brutally honest, horribly convoluted.
It’s not a very clean implementation, but it’s much faster than my clumsy attempts.
Since we all *love* comparative benchmarks, here are the latest results:
================================================================================ Fill 100000 Sequential Indexes -------------------------------------------------------------------------------- Vector Time: 101.662697ms BitVector Time: 11.374888ms Vector Memory: 68720.491211 KB BitVector Memory: 14049.398438 KB Time Difference: 90.287809ms Percentage Difference: 88.811149% Memory Difference: 54671.092773 KB ================================================================================ ================================================================================ Read 100000 Sequential Indexes -------------------------------------------------------------------------------- Vector Time: 14.773393ms BitVector Time: 2.675172ms Vector Memory: 6736.846680 KB BitVector Memory: 0.023438 KB Time Difference: 12.098221ms Percentage Difference: 81.891959% Memory Difference: 6736.823242 KB ================================================================================ ================================================================================ Read 100000 Random Indexes -------------------------------------------------------------------------------- Vector Time: 62.755825ms BitVector Time: 32.797443ms Vector Memory: 8471.927734 KB BitVector Memory: 0.023438 KB Time Difference: 29.958382ms Percentage Difference: 47.738010% Memory Difference: 8471.904297 KB ================================================================================ ================================================================================ Reverse of Length 100000 -------------------------------------------------------------------------------- Vector Time: 0.013947ms BitVector Time: 0.010071ms Vector Memory: 0.093750 KB BitVector Memory: 0.039063 KB Time Difference: 0.003876ms Percentage Difference: 27.790923% Memory Difference: 0.054688 KB ================================================================================ ================================================================================ Compute Length (100000) -------------------------------------------------------------------------------- Vector Time: 0.008080ms BitVector Time: 0.007193ms Vector Memory: 0.023438 KB BitVector Memory: 0.023438 KB Time Difference: 0.000887ms Percentage Difference: 10.977723% Memory Difference: 0.000000 KB ================================================================================The results pretty-much speak for themselves. Clojure’s vector implementation is really phenomenally fast, beating out both my Vector as well as David’s IntMap (not included in the above results). I bow to your superior hackery, Rich.
The implementation is all here: http://www.codecommit.com/blog/misc/more-persistent-vectors-performance-analysis/BitVector.scala
I’ll admit that I was initially confused by the BitVector class name, which I expected would be an array-like structure for storing boolean values.
Thanks for this post and the follow-up, they are very enlightening.
Post a Comment