New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFE: Replace immutable.Vector with a faster implementation (src included) #3724
Comments
Imported From: https://issues.scala-lang.org/browse/SI-3724?orig=1
|
@djspiewak said: |
Antony Stubbs (astubbs) said: |
@djspiewak said:
I've thought about doing that, but I'm leery since the overhead is different. I have seen numbers comparing Clojure's PersistentVector with ArrayList, and those tend to be about on the same order as my numbers comparing my Vector port with ArrayBuffer. Another thing to consider is that my port of Clojure's PersistentVector actually includes some optimizations which Clojure is missing. They were all Rich Hickey's idea, he just hasn't had time to implement them himself yet. |
Michel Salim (hircus) said: It retains the BSD license and source attribution just because I wanted to play it safe The Clojure source is EPL-licensed -- do you have explicit permission from Rich Hickey to have the code under the BSD license? |
@djspiewak said:
Clojure may be EPL licensed, but the data structures are under BSD. I have explicit confirmation of that via email, but you can also see it if you look in the headers of the source for PersistentVector et al. |
@TiarkRompf said: While the trie structure is similar, there are some differences in how the implementation optimizes specific access patterns. Clojure chose a fairly simple model to make access to the right end faster: the last 32-wide block is kept outside the tree so that it can be accessed in constant time, without going through multiple tree levels. To add an element (on the right) to a Clojure Vector, only this max. 32-wide array has to be copied. We took this model a step further. In Scala, Vectors have a notion of 'focus', which identifies a single 32-wide block that is accessible in constant time. Every update operation (be it at the left end, the right end, or any other position), will put the target block in focus. Updates to a block 'in focus' will copy only that particular 32-wide block (Clojure's implementation has to copy all blocks on the path to the root for positions that are not in the rightmost block). Moreover, all the tree nodes from the root to the block in focus are kept in a 'display', i.e. a constant-time stack. Moving the focus to an adjacent 32-block incurs only one indirection through the display (possibly copying the node one level up). Indexed reads also use the display to minimize the number of indirections. The design decision behind this model is to optimize for spacio-temporal access locality. Sequentially rewriting parts of a Vector (starting from an arbitrary position) should be fast (Clojure's optimization for growing by appending to the right is a special case). Reading elements close to the last update should be fast. More generally, one central goal of the Vector design was to build the most versatile general purpose collection we could. This means that we tried to make as many operations as possible fast, but not optimize the hell out of a single one at the cost of others. As a corollary, we ruled out any model which is asymmetric regarding its left and right ends (we already have Lists) or only geared towards appending (we have Builders for that). Some operations we considered important are take, drop, takeRight, dropRight, splitAt etc (none of them are optimized in Clojure's or your implementation). Even more important is good support for Iterators and Builders because these are central to many other general collection operations. Your benchmarks show your implementation is 3 to 4 times slower for reverse, which is the only Iterator/Builder based operation you provide measurements for. My intuition is that others will show similar numbers. Without evidence that an implementation performs better on all Seq operations (well, let's say most) we cannot seriously consider it as a contender for replacing the current one. What your implementation shows, however, is that indexed access can be done faster in some cases. This is not completely surprising (it's, well, a 2x speedup in a micro benchmark), so the question is whether the results can be generalized. From my perspective, indexed access is an important operation, but for me it is not top priority. The cost of indexing is inherently log n so we can only get so far. A common use for indexing would be iteration, but iteration can in fact be done much faster. So I believe indexing is really only for random access. Now if a program's bottleneck is indeed random access performance, a log n collection will never do, and a speedup of 2 won't help. In that case, an ArrayBuffer or plain Array is the appropriate choice. That said, it is worth to look at some details. In essence, your implementation seems to do two things differently:
In your benchmarks, this doesn't show up since your call sites of apply are all monomorphic. You also use very aggressive manual inlining. That's ok for implementing a small number of operations. But if you extend it to more operations, I could imagine it getting unwieldy. After all, somebody has to maintain the code. Another possible concern is that HotSpot is less likely to compile large methods (there might also be an instruction cache penalty). In conclusion, we appreciate the effort you undertook in porting Clojure's Vector. Exploring different paths is always a valuable thing. As I have explained, we cannot accept it as a replacement in its current form. However, if you would like to help improve scala.collection.immutable.Vector, you are more than welcome. The current implementation is certainly not etched in stone. There are several avenues one could try: 1) Use typed Arrays, as you did, and see how that performs. 2) Instead of keeping a full 6-element display, keep only the bottom level (like Clojure, but with a variable focus position). 3) The current implementation has a large memory overhead for very small Vectors. Can we reduce that footprint? Obviously, we can build special subclasses for Vectors of 1,2,3,... elements, but that means we're again depending on monomorphic call sites. Another strategy would be turning the 6 display pointers into AnyRefs and making them hold actual data elements for Vectors of size <= 6. Again, this has to be carefully weighed against the cost of casting them to arrays for larger Vectors. Cheers,
|
@djspiewak said:
That would be me spreading that rumor. :-) While Scala's Vector is a tree, it doesn't push the data to the leaves, so you can't really say that it's doing the same thing that Clojure's tries do. I'll spend some more time with the implementation, but it really does look like a skip-list to me. Maybe it's just the fact that I'm looking at the six display pointers sideways, but that's what it seems like. In any case, the difference between a skip-list and a tree is really just in whether access is optimized for level tips or branch traversal, and I wasn't able to grok the implementation sufficiently to see which holds true. As you mentioned, the focus optimization changes which locality is fast to reference, so the difference between something which is conceptually a skip-list and something which is conceptually a tree is basically nil.
General sequence operations would certainly be slower than Scala's Vector as you have described the optimizations (which are certainly interesting). I absolutely disagree with the decision to make a general purpose collection though. What's done is done, but it seems that if you want fast prepend, just use a List. In my experience, you usually know the shape of your problem ahead of time, so picking out a specialized collection isn't an issue. Unfortunately, because you chose to optimize for the widest set of operations possible, you have rendered Vector less useful than it could have been in filling a specialized niche left vacant by the remainder of the collections library. It's still a nice data structure, just not as nice as it could be IMHO. And don't even get me started about the lack of + and update methods, but this isn't the right forum for that particular rant...
Seems to work quite well in practice for Clojure. I don't have any raw numbers from them, but Rich Hickey has more than hinted that Clojure's vectors perform astonishingly well in large, non-trivial applications with extensive random access.
When the base of the log term is 32, the result is effectively constant if only due to the limits of 32 bit integers. So, the upper-bound on the log(n) term is 6, which means that all of the performance is in the constant factor. A 2x improvement there is nothing to sniff at.
(eliding quoted analysis for brevity) It's interesting that you point out exactly these two things, because these were optimizations implemented quite recently. If you look at Clojure's PersistentVector class, you will notice that all of its implementation is done using heterogeneous arrays and non-inlined loops and recursion. This was the original implementation technique that I used when I ported the data structure, but I switched to the inlining upon the suggestion of Rich Hickey (he had been planning on implementing the same optimization, but hadn't gotten around to it yet). As a matter of interest, the inlining didn't have any appreciable impact on performance (at least not in my benchmarks). The typed arrays sort of fell out of the inlining, and since typed arrays are usually faster than untyped arrays (as you said), I decided to keep them.
Very interesting. I appreciate your analysis of HotSpot's inlining fu. It does explain some of the results I was seeing.
Quite so. As I mentioned, this is a fairly recent optimization that didn't produce much of an improvement. It wouldn't be too hard to use the un-inlined versions instead.
Well, this is disappointing, but not entirely surprising. Since you had access to my original port of Clojure's Vector before you started, I assumed there had to be some reason why you rolled your own, but I was never privy to the details. This certainly clears things up. While I still take serious issue with some of the API, the underlying data structure seems well justified. |
@odersky said: |
@djspiewak said:
It's purely a syntactic issue. With immutable Map, you can do something like the following: val m1 = Map('a -> 1, 'b -> 2)
val m2 = m1('a) = 3 Unfortunately, Vector uses "updated" rather than "update", standing out as the only collection to break with this convention. It also does not implement the + operator, which is particularly bizarre given the fact that + has been conventionally used in Scala for sequence append. If you prefer, I can start a thread on some mailing list regarding this issue so that we don't have to hijack a closed bug. |
@ijuma said: scala> val m2 = m1('a) = 3
warning: there were deprecation warnings; re-run with -deprecation for details
m2: scala.collection.immutable.Map[Symbol,Int] = Map(('a,3), ('b,2)) @deprecated("use `updated' instead")
def update[B1 >: B](key: A, value: B1): immutable.Map[A, B1] = updated(key, value).asInstanceOf[immutable.Map[A, B1]] Vector is following the convention for 2.8.0. You may disagree with the convention, of course, but Vector should not be singled out for using updated for immutable updates. |
Michel Salim (hircus) said:
I perused the history for PersistentVector.java and discovered that, as it happens, we're both right: http://github.com/richhickey/clojure/commits/master/src/jvm/clojure/lang/PersistentVector.java Persistent{HashMap,Vector} were relicensed as BSD on 2008-11-13 but, probably by accident, relicensed back to EPL on 2009-07-16 I'll verify on the clojure-dev mailing list to see what's going on. Probably academic for Scala ATM, but there are other Btree-based data structure implementations (e.g. python-blist) and it would be nice if they could share improvements. |
@djspiewak said:
Not with a recent build, which explains how I missed the deprecation.
I agree, not with the deprecation, but with the not singling out Vector. I think it's a terribly unfortunate API design decision, but I can live with it as long as it's not just Vector. |
@SethTisue said: |
The data structure used in collection.immutable.Vector is quite fast, but it's possible to do even better. I created a port of Clojure's PersistentVector?, reimplementing it from the ground up within the Scala 2.8 collections framework. The underlying data structure is a bitmapped dense trie, rather than the skip-list-like structure used by the Vector in the standard library. Because of the implementation as a dense trie, the locality of reference for Clojure's vector is substantially better than the Vector in the Scala standard library. It also lends itself nicely to a manual loop unrolling optimization which allows HotSpot? to inline almost all of the calls into O(6) array accesses.
I've done some extensive benchmarking, the results of which I've included below. It looks like my port of Clojure's PersistentVector? is around 50% faster than scala.collection.immutable.Vector on both random and sequential reads (index-based reads, not traversal). Performance is almost equivalent for writes. The port of Clojure's PersistentVector? uses almost 50% less memory on writes, and exactly the same memory for reads (i.e. none). The only operation which the stdlib Vector seems to do better than the port of Clojure's is the reverse operation. The raw results are included at the bottom of this ticket.
As I mentioned, this is a complete rewrite of Clojure's PersistentVector?. It retains the BSD license and source attribution just because I wanted to play it safe, but the source is now so far removed from the original that we can probably discard that (though I'd like to hear from Rich before we just delete the header).
It is worth noting that the stdlib Vector seems to support a fast prepend operation, whereas the dense trie data structure requires O(n) for this operation (append is still O(1)). In every other respect, the port of Clojure's PersistentVector? is either dramatically superior, or equivalent to the implementation in the stdlib. On top of it all, my port includes an update(Int, A) method, while this is glaringly absent from the stdlib API.
Is there any reason why we can't replace the stdlib Vector implementation with this one? APIs (such as updated, :+ and +:) could be added for backwards compatibility, but it seems to me that the performance characteristics of this port are so far beyond those of the stdlib that it really doesn't make any technical sense to stick with the current implementation.
Source and specs are attached.
The text was updated successfully, but these errors were encountered: