Scala Programming Language
  1. Scala Programming Language
  2. SI-3724

RFE: Replace immutable.Vector with a faster implementation (src included)

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Misc Library
    • Labels:
      None

      Description

      The data structure used in collection.immutable.Vector is quite fast, but it's possible to do even better. I created a port of Clojure's PersistentVector?, reimplementing it from the ground up within the Scala 2.8 collections framework. The underlying data structure is a bitmapped dense trie, rather than the skip-list-like structure used by the Vector in the standard library. Because of the implementation as a dense trie, the locality of reference for Clojure's vector is substantially better than the Vector in the Scala standard library. It also lends itself nicely to a manual loop unrolling optimization which allows HotSpot? to inline almost all of the calls into O(6) array accesses.

      I've done some extensive benchmarking, the results of which I've included below. It looks like my port of Clojure's PersistentVector? is around 50% faster than scala.collection.immutable.Vector on both random and sequential reads (index-based reads, not traversal). Performance is almost equivalent for writes. The port of Clojure's PersistentVector? uses almost 50% less memory on writes, and exactly the same memory for reads (i.e. none). The only operation which the stdlib Vector seems to do better than the port of Clojure's is the reverse operation. The raw results are included at the bottom of this ticket.

      As I mentioned, this is a complete rewrite of Clojure's PersistentVector?. It retains the BSD license and source attribution just because I wanted to play it safe, but the source is now so far removed from the original that we can probably discard that (though I'd like to hear from Rich before we just delete the header).

      It is worth noting that the stdlib Vector seems to support a fast prepend operation, whereas the dense trie data structure requires O for this operation (append is still O(1)). In every other respect, the port of Clojure's PersistentVector? is either dramatically superior, or equivalent to the implementation in the stdlib. On top of it all, my port includes an update(Int, A) method, while this is glaringly absent from the stdlib API.

      Is there any reason why we can't replace the stdlib Vector implementation with this one? APIs (such as updated, :+ and + could be added for backwards compatibility, but it seems to me that the performance characteristics of this port are so far beyond those of the stdlib that it really doesn't make any technical sense to stick with the current implementation.

      Source and specs are attached.

      
      Hardware: 2.66 Ghz Dual Core i7, 8 GB RAM, SATA2 SSD
      OS:       MacOS X 10.6.4
      JVM:      Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01-279, mixed mode)
      JVM Args: -Xmx2048m -Xms512m -server
      
      Setup:    All tests run 12 times with the highest and lowest results dropped.
                System.gc() was called after each test run.  Runtime#freeMemory()
                was used to measure memory usage and System.currentTimeMillis() was
                used to measure temporal performance.
      
      My port of Clojure's Vector is represented by 'Vector', while the 
      scala.collection.immutable.Vector class is represented by 'DefVector'.
      
      ================================================================================
        Fill 100000 Sequential Indexes
      --------------------------------------------------------------------------------
                  Vector Time: 11.174800ms
             ArrayBuffer Time: 1.823500ms
                Vector Memory: 23856.264648 KB
           ArrayBuffer Memory: 6266.987305 KB
      
              Time Difference: 9.351300ms
        Percentage Difference: 512.821497%
      
            Memory Difference: 17589.277344 KB
      --------------------------------------------------------------------------------
                  Vector Time: 11.174800ms
               DefVector Time: 11.018000ms
                Vector Memory: 23856.264648 KB
             DefVector Memory: 41946.000000 KB
      
              Time Difference: 0.156800ms
        Percentage Difference: 1.423126%
      
            Memory Difference: -18089.735352 KB
      --------------------------------------------------------------------------------
                  Vector Time: 11.174800ms
                  IntMap Time: 9.771800ms
                Vector Memory: 23856.264648 KB
                IntMap Memory: 39322.117188 KB
      
              Time Difference: 1.403000ms
        Percentage Difference: 14.357641%
      
            Memory Difference: -15465.852539 KB
      --------------------------------------------------------------------------------
                  Vector Time: 11.174800ms
             Map[Int, _] Time: 49.916800ms
                Vector Memory: 23856.264648 KB
           Map[Int, _] Memory: 104858.377930 KB
      
              Time Difference: -38.742000ms
        Percentage Difference: -77.613148%
      
            Memory Difference: -81002.113281 KB
      ================================================================================
      
      ================================================================================
        Read 100000 Sequential Indexes
      --------------------------------------------------------------------------------
                  Vector Time: 0.939000ms
             ArrayBuffer Time: 0.186600ms
                Vector Memory: 0.031250 KB
           ArrayBuffer Memory: 0.031250 KB
      
              Time Difference: 0.752400ms
        Percentage Difference: 403.215434%
      
            Memory Difference: 0.000000 KB
      --------------------------------------------------------------------------------
                  Vector Time: 0.939000ms
               DefVector Time: 1.932000ms
                Vector Memory: 0.031250 KB
             DefVector Memory: 0.031250 KB
      
              Time Difference: -0.993000ms
        Percentage Difference: -51.397516%
      
            Memory Difference: 0.000000 KB
      --------------------------------------------------------------------------------
                  Vector Time: 0.939000ms
                  IntMap Time: 7.413300ms
                Vector Memory: 0.031250 KB
                IntMap Memory: 0.031250 KB
      
              Time Difference: -6.474300ms
        Percentage Difference: -87.333576%
      
            Memory Difference: 0.000000 KB
      --------------------------------------------------------------------------------
                  Vector Time: 0.939000ms
             Map[Int, _] Time: 30.118900ms
                Vector Memory: 0.031250 KB
           Map[Int, _] Memory: 5242.539063 KB
      
              Time Difference: -29.179900ms
        Percentage Difference: -96.882356%
      
            Memory Difference: -5242.507813 KB
      ================================================================================
      
      ================================================================================
        Read 100000 Random Indexes
      --------------------------------------------------------------------------------
                  Vector Time: 10.785900ms
             ArrayBuffer Time: 0.226100ms
                Vector Memory: 0.031250 KB
           ArrayBuffer Memory: 0.031250 KB
      
              Time Difference: 10.559800ms
        Percentage Difference: 4670.411322%
      
            Memory Difference: 0.000000 KB
      --------------------------------------------------------------------------------
                  Vector Time: 10.785900ms
               DefVector Time: 18.195500ms
                Vector Memory: 0.031250 KB
             DefVector Memory: 0.031250 KB
      
              Time Difference: -7.409600ms
        Percentage Difference: -40.722157%
      
            Memory Difference: 0.000000 KB
      --------------------------------------------------------------------------------
                  Vector Time: 10.785900ms
                  IntMap Time: 39.081400ms
                Vector Memory: 0.031250 KB
                IntMap Memory: 0.031250 KB
      
              Time Difference: -28.295500ms
        Percentage Difference: -72.401449%
      
            Memory Difference: 0.000000 KB
      --------------------------------------------------------------------------------
                  Vector Time: 10.785900ms
             Map[Int, _] Time: 39.914500ms
                Vector Memory: 0.031250 KB
           Map[Int, _] Memory: 7832.765625 KB
      
              Time Difference: -29.128600ms
        Percentage Difference: -72.977489%
      
            Memory Difference: -7832.734375 KB
      ================================================================================
      
      ================================================================================
        Reverse of Length 100000
      --------------------------------------------------------------------------------
                  Vector Time: 16.455500ms
             ArrayBuffer Time: 1.142500ms
                Vector Memory: 23457.923828 KB
           ArrayBuffer Memory: 7832.945313 KB
      
              Time Difference: 15.313000ms
        Percentage Difference: 1340.306346%
      
            Memory Difference: 15624.978516 KB
      --------------------------------------------------------------------------------
                  Vector Time: 16.455500ms
               DefVector Time: 4.231400ms
                Vector Memory: 23457.923828 KB
             DefVector Memory: 7833.148438 KB
      
              Time Difference: 12.224100ms
        Percentage Difference: 288.890202%
      
            Memory Difference: 15624.775391 KB
      ================================================================================
      
      ================================================================================
        Compute Length (100000)
      --------------------------------------------------------------------------------
                  Vector Time: 0.006600ms
             ArrayBuffer Time: 0.017300ms
                Vector Memory: 0.031250 KB
           ArrayBuffer Memory: 0.031250 KB
      
              Time Difference: -0.010700ms
        Percentage Difference: -61.849711%
      
            Memory Difference: 0.000000 KB
      --------------------------------------------------------------------------------
                  Vector Time: 0.006600ms
               DefVector Time: 0.016400ms
                Vector Memory: 0.031250 KB
             DefVector Memory: 0.031250 KB
      
              Time Difference: -0.009800ms
        Percentage Difference: -59.756098%
      
            Memory Difference: 0.000000 KB
      --------------------------------------------------------------------------------
                  Vector Time: 0.006600ms
                  IntMap Time: 2.059400ms
                Vector Memory: 0.031250 KB
                IntMap Memory: 0.031250 KB
      
              Time Difference: -2.052800ms
        Percentage Difference: -99.679518%
      
            Memory Difference: 0.000000 KB
      --------------------------------------------------------------------------------
                  Vector Time: 0.006600ms
             Map[Int, _] Time: 0.013500ms
                Vector Memory: 0.031250 KB
           Map[Int, _] Memory: 0.031250 KB
      
              Time Difference: -0.006900ms
        Percentage Difference: -51.111111%
      
            Memory Difference: 0.000000 KB
      ================================================================================
      
      1. Vector.scala
        31 kB
        Jira Admin
      2. VectorSpecs.scala
        10 kB
        Jira Admin

        Activity

        Hide
        Daniel Spiewak added a comment -

        Replying to [comment:20 odersky]:
        > I don't understand: Vectors do have : (prepend), : (append) and updated methods.
        > Which functionality are you missing?

        It's purely a syntactic issue. With immutable Map, you can do something like the following:

        val m1 = Map('a -> 1, 'b -> 2)
        val m2 = m1('a) = 3
        

        Unfortunately, Vector uses "updated" rather than "update", standing out as the only collection to break with this convention. It also does not implement the + operator, which is particularly bizarre given the fact that + has been conventionally used in Scala for sequence append.

        If you prefer, I can start a thread on some mailing list regarding this issue so that we don't have to hijack a closed bug.

        Show
        Daniel Spiewak added a comment - Replying to [comment:20 odersky] : > I don't understand: Vectors do have : (prepend), : (append) and updated methods. > Which functionality are you missing? It's purely a syntactic issue. With immutable Map, you can do something like the following: val m1 = Map('a -> 1, 'b -> 2) val m2 = m1('a) = 3 Unfortunately, Vector uses "updated" rather than "update", standing out as the only collection to break with this convention. It also does not implement the + operator, which is particularly bizarre given the fact that + has been conventionally used in Scala for sequence append. If you prefer, I can start a thread on some mailing list regarding this issue so that we don't have to hijack a closed bug.
        Hide
        Ismael Juma added a comment -

        Daniel, have you actually tried the code you pasted with Scala 2.8.0?

        scala> val m2 = m1('a) = 3
        warning: there were deprecation warnings; re-run with -deprecation for details
        m2: scala.collection.immutable.Map[Symbol,Int] = Map(('a,3), ('b,2))
        
        @deprecated("use `updated' instead")
        def update[B1 >: B](key: A, value: B1): immutable.Map[A, B1] = updated(key, value).asInstanceOf[immutable.Map[A, B1]]
        

        Vector is following the convention for 2.8.0. You may disagree with the convention, of course, but Vector should not be singled out for using updated for immutable updates.

        Show
        Ismael Juma added a comment - Daniel, have you actually tried the code you pasted with Scala 2.8.0? scala> val m2 = m1('a) = 3 warning: there were deprecation warnings; re-run with -deprecation for details m2: scala.collection.immutable.Map[Symbol,Int] = Map(('a,3), ('b,2)) @deprecated("use `updated' instead") def update[B1 >: B](key: A, value: B1): immutable.Map[A, B1] = updated(key, value).asInstanceOf[immutable.Map[A, B1]] Vector is following the convention for 2.8.0. You may disagree with the convention, of course, but Vector should not be singled out for using updated for immutable updates.
        Hide
        Michel Salim added a comment -

        Replying to [comment:12 djspiewak]:
        > Replying to [comment:11 hircus]:
        > > The Clojure source is EPL-licensed – do you have explicit permission from Rich Hickey to have the code under the BSD license?
        >
        > Clojure may be EPL licensed, but the data structures are under BSD. I have explicit confirmation of that via email, but you can also see it if you look in the headers of the source for PersistentVector et al.

        I perused the history for PersistentVector.java and discovered that, as it happens, we're both right:

        http://github.com/richhickey/clojure/commits/master/src/jvm/clojure/lang/PersistentVector.java

        Persistent

        {HashMap,Vector}

        were relicensed as BSD on 2008-11-13
        http://github.com/richhickey/clojure/commit/2fb7a11de0332080286ae74ae6eccf8bf2749621

        but, probably by accident, relicensed back to EPL on 2009-07-16
        http://github.com/richhickey/clojure/commit/f2de9c79eb5978a6a81ad3d0994ec44f490e3152

        I'll verify on the clojure-dev mailing list to see what's going on. Probably academic for Scala ATM, but there are other Btree-based data structure implementations (e.g. python-blist) and it would be nice if they could share improvements.

        Show
        Michel Salim added a comment - Replying to [comment:12 djspiewak] : > Replying to [comment:11 hircus] : > > The Clojure source is EPL-licensed – do you have explicit permission from Rich Hickey to have the code under the BSD license? > > Clojure may be EPL licensed, but the data structures are under BSD. I have explicit confirmation of that via email, but you can also see it if you look in the headers of the source for PersistentVector et al. I perused the history for PersistentVector.java and discovered that, as it happens, we're both right: http://github.com/richhickey/clojure/commits/master/src/jvm/clojure/lang/PersistentVector.java Persistent {HashMap,Vector} were relicensed as BSD on 2008-11-13 http://github.com/richhickey/clojure/commit/2fb7a11de0332080286ae74ae6eccf8bf2749621 but, probably by accident, relicensed back to EPL on 2009-07-16 http://github.com/richhickey/clojure/commit/f2de9c79eb5978a6a81ad3d0994ec44f490e3152 I'll verify on the clojure-dev mailing list to see what's going on. Probably academic for Scala ATM, but there are other Btree-based data structure implementations (e.g. python-blist) and it would be nice if they could share improvements.
        Hide
        Daniel Spiewak added a comment -

        Replying to [comment:22 ijuma]:
        > Daniel, have you actually tried the code you pasted with Scala 2.8.0?

        Not with a recent build, which explains how I missed the deprecation.

        > Vector is following the convention for 2.8.0. You may disagree with the convention, of course, but Vector should not be singled out for using updated for immutable updates.

        I agree, not with the deprecation, but with the not singling out Vector. I think it's a terribly unfortunate API design decision, but I can live with it as long as it's not just Vector.

        Show
        Daniel Spiewak added a comment - Replying to [comment:22 ijuma] : > Daniel, have you actually tried the code you pasted with Scala 2.8.0? Not with a recent build, which explains how I missed the deprecation. > Vector is following the convention for 2.8.0. You may disagree with the convention, of course, but Vector should not be singled out for using updated for immutable updates. I agree, not with the deprecation, but with the not singling out Vector. I think it's a terribly unfortunate API design decision, but I can live with it as long as it's not just Vector.
        Hide
        Seth Tisue added a comment -

        I made a separate ticket on fast concat (Vector ++ Vector) (SI-4442)

        Show
        Seth Tisue added a comment - I made a separate ticket on fast concat (Vector ++ Vector) ( SI-4442 )

          People

          • Assignee:
            Tiark Rompf
            Reporter:
            Daniel Spiewak
            TracCC:
            Antony Stubbs, Ben Lings, Christos KK Loverdos, Daniel Sobral, Erik Engbrecht, Ismael Juma, Johannes Rudolph, Jorge Ortiz, Michel Salim, Rafael de F. Ferreira, Seth Tisue
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development