Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: Backlog
    • Component/s: Compiler (Misc)
    • Labels:
      None
    • Environment:

      compiler, optimize, for loop, while

      Description

      I would like to suggest the compiler optimize the common case of for loops, that is,

      for (var <- Range [by step])
      for (var <- int to int [by step])
      for (var <- int until int [by step])
      

      to use while loops under the covers instead of. Currently, nested for loops using ranges/iterators are sometimes an order of magnitude slower than while loops. However, while loop constructs for iterating over arrays are very cumbersome, and the functional style (foreach) is also cumbersome and introduces a lot of function call overhead as well.

      The following two matrix multipication implementations

        def matMulUsingIterators (
             a : Array[Array[Double]],
             b : Array[Array[Double]],
             c : Array[Array[Double]]) : Unit = {
       
          val b_j = new Array[Double](b.length)
       
          for (j <- 0 until b(0).length) {
              for (k <- 0 until b.length) {
                  b_j(k) = b(k)(j)
              }
              for (i <- 0 until a.length) {
                  val c_i = c(i)
                  val a_i = a(i)
                  var s = 0.0d;
                  for (k <- 0 until b.length) {
                      s += a_i(k) * b_j(k)
                  }
                  c_i(j) = s
              }
          }
        }
       
        def matMulUsingRanges (
             a : Array[Array[Double]],
             b : Array[Array[Double]],
             c : Array[Array[Double]]) : Unit = {
       
          val jRange = 0 until b(0).length;
          val kRange = 0 until b.length;
          val iRange = 0 until a.length;
       
          val b_j = new Array[Double](b.length)
       
          for (j <- jRange) {
              for (k <- kRange) {
                  b_j(k) = b(k)(j)
              }
              for (i <- iRange) {
                  val c_i = c(i);
                  val a_i = a(i);
                  var s = 0.0d;
                  for (k <- kRange) {
                      s += a_i(k) * b_j(k)
                  }
                  c_i(j) = s
              }
          }
        }
      

      are much slower than the same algorithm coded with while loops:

        def matMulUsingWhileLoop (
            a : Array[Array[Double]],
            b : Array[Array[Double]],
            c : Array[Array[Double]]) : Unit = {
       
          val m = a.length;
          val p = b(0).length;
          val n = b.length;
       
          val b_j = new Array[Double](b.length);
       
          var i = 0; var j = 0; var k = 0;
          while (j < p) {
              k = 0
              while (k < n) {
                  b_j(k) = b(k)(j);
                  k += 1
              }
              i = 0
              while (i < m) {
                  val c_i = c(i);
                  val a_i = a(i);
                  var s = 0.0d;
                  k = 0;
                  while (k < n) {
                      s += a_i(k) * b_j(k);
                      k += 1
                  }
                  c_i(j) = s;
                  i += 1
              }
              j += 1;
          }
        }
      

      but the while loop code is more complex and error prone.

      (Sorry, Trac appears to remove some line breaks; I
      added some explicit semis but might have missed some;
      I'll try attaching actual working source code)

      Running this while measuring time in nanoseconds:

      Iterators   2,807,815,301ns
      Ranges      2,789,958,191ns
      While Loop  190,778,574ns
      

      MatMul by Iterators is 14 times as slow as with while loops.

      It does not appear that the Hotspot runtime profiling and optimization dramatically helps this performance problem
      This performance problem can hurt adoption of Scala for many types of uses/applications.

        Attachments

          Activity

          djb David Biesack created issue -
          Hide
          djb David Biesack added a comment -

          // Scala code to compare performance of nested int loops
          object MatMul {
           
           
            def matMulUsingIterators (
                 a : Array[Array[Double]],
                 b : Array[Array[Double]],
                 c : Array[Array[Double]]) : Unit = {
           
              val b_j = new Array[Double](b.length)
           
              for (j <- 0 until b(0).length) {
                  for (k <- 0 until b.length) {
                      b_j(k) = b(k)(j)
                  }
                  for (i <- 0 until a.length) {
                      val c_i = c(i)
                      val a_i = a(i)
                      var s = 0.0d;
                      for (k <- 0 until b.length) {
                          s += a_i(k) * b_j(k)
                      }
                      c_i(j) = s
                  }
              }
            }
           
            def matMulUsingRanges (
                 a : Array[Array[Double]],
                 b : Array[Array[Double]],
                 c : Array[Array[Double]]) : Unit = {
           
              val jRange = 0 until b(0).length // p
              val kRange = 0 until b.length // n
              val iRange = 0 until a.length // m
           
              val b_j = new Array[Double](b.length)
           
              for (j <- jRange) {
                  for (k <- kRange) {
                      b_j(k) = b(k)(j)
                  }
                  for (i <- iRange) {
                      val c_i = c(i)
                      val a_i = a(i)
                      var s = 0.0d;
                      for (k <- kRange) {
                          s += a_i(k) * b_j(k)
                      }
                      c_i(j) = s
                  }
              }
            }
           
            def matMulUsingLimits (
                 a : Array[Array[Double]],
                 b : Array[Array[Double]],
                 c : Array[Array[Double]]) : Unit = {
           
              val b_j = new Array[Double](b.length)
           
              val m = a.length
              val p = b(0).length
              val n = b.length
           
              for (j <- 0 until p) {
                  for (k <- 0 until n) {
                      b_j(k) = b(k)(j)
                  }
                  for (i <- 0 until m) {
                      val c_i = c(i)
                      val a_i = a(i)
                      var s = 0.0d;
                      for (k <- 0 until n) {
                          s += a_i(k) * b_j(k)
                      }
                      c_i(j) = s
                  }
              }
            }
           
            def matMulUsingWhileLoop (
                a : Array[Array[Double]],
                b : Array[Array[Double]],
                c : Array[Array[Double]]) : Unit = {
           
              val m = a.length
              val p = b(0).length
              val n = b.length
           
              val b_j = new Array[Double](b.length)
           
              var i = 0; var j = 0; var k = 0
              while (j < p) {
                  k = 0
                  while (k < n) {
                      b_j(k) = b(k)(j)
                      k += 1
                  }
                  i = 0
                  while (i < m) {
                      val c_i = c(i)
                      val a_i = a(i)
                      var s = 0.0d;
                      k = 0
                      while (k < n) {
                          s += a_i(k) * b_j(k)
                          k += 1
                      }
                      c_i(j) = s
                      i += 1
                  }
                  j += 1
              }
            }
           
            def time[R](block: => R) : (Long, R) = {
              val start = System.nanoTime()
              val result : R = block
              val time = System.nanoTime() - start
              (time, result)
            }
           
            val format = new java.text.DecimalFormat("0,000'ns'");
            def report[R](label: String, result: (Long, R)) = {
               println(label + " " + format.format(result._1))
               }
           
              private val FACTOR = 5
              private val M = 80
              private val N = 70
              private val P = 60
           
              def main(args : Array[String]) = {
                for (trial <- 3 until 0 by -1) {
                val factor = (if (System.getProperty("factor") != null)
                                Integer.parseInt(System.getProperty("factor"))
                              else
                                  FACTOR)
                val multiplier = if (trial == 1) factor else 1;
                val m = M * multiplier
                val n = N * multiplier
                val p = P * multiplier
                val a = new Array[Array[Double]](m,n)
                val b = new Array[Array[Double]](n,p)
                val c = new Array[Array[Double]](m,p)
                println("\nMultiply c[" + m + "," + p + "] = a[" + m + "," + n + "] times b[" + n + "," + p + "]\n");
                val whileTime = time(matMulUsingWhileLoop(a,b,c))
                val iterTime = time(matMulUsingIterators(a,b,c))
                report("Iterators  ", iterTime)
                report("Limits     ", time(matMulUsingLimits(a,b,c)))
                report("Ranges     ", time(matMulUsingRanges(a,b,c)))
                report("While Loop ", whileTime)
                println("MatMul by Iterators is " + iterTime._1 / whileTime._1 + " times as slow as with while loops.")
                }
            }
          }
           

          Show
          djb David Biesack added a comment - // Scala code to compare performance of nested int loops object MatMul {     def matMulUsingIterators ( a : Array[Array[Double]], b : Array[Array[Double]], c : Array[Array[Double]]) : Unit = {   val b_j = new Array[Double](b.length)   for (j <- 0 until b(0).length) { for (k <- 0 until b.length) { b_j(k) = b(k)(j) } for (i <- 0 until a.length) { val c_i = c(i) val a_i = a(i) var s = 0.0d; for (k <- 0 until b.length) { s += a_i(k) * b_j(k) } c_i(j) = s } } }   def matMulUsingRanges ( a : Array[Array[Double]], b : Array[Array[Double]], c : Array[Array[Double]]) : Unit = {   val jRange = 0 until b(0).length // p val kRange = 0 until b.length // n val iRange = 0 until a.length // m   val b_j = new Array[Double](b.length)   for (j <- jRange) { for (k <- kRange) { b_j(k) = b(k)(j) } for (i <- iRange) { val c_i = c(i) val a_i = a(i) var s = 0.0d; for (k <- kRange) { s += a_i(k) * b_j(k) } c_i(j) = s } } }   def matMulUsingLimits ( a : Array[Array[Double]], b : Array[Array[Double]], c : Array[Array[Double]]) : Unit = {   val b_j = new Array[Double](b.length)   val m = a.length val p = b(0).length val n = b.length   for (j <- 0 until p) { for (k <- 0 until n) { b_j(k) = b(k)(j) } for (i <- 0 until m) { val c_i = c(i) val a_i = a(i) var s = 0.0d; for (k <- 0 until n) { s += a_i(k) * b_j(k) } c_i(j) = s } } }   def matMulUsingWhileLoop ( a : Array[Array[Double]], b : Array[Array[Double]], c : Array[Array[Double]]) : Unit = {   val m = a.length val p = b(0).length val n = b.length   val b_j = new Array[Double](b.length)   var i = 0; var j = 0; var k = 0 while (j < p) { k = 0 while (k < n) { b_j(k) = b(k)(j) k += 1 } i = 0 while (i < m) { val c_i = c(i) val a_i = a(i) var s = 0.0d; k = 0 while (k < n) { s += a_i(k) * b_j(k) k += 1 } c_i(j) = s i += 1 } j += 1 } }   def time[R](block: => R) : (Long, R) = { val start = System.nanoTime() val result : R = block val time = System.nanoTime() - start (time, result) }   val format = new java.text.DecimalFormat("0,000'ns'"); def report[R](label: String, result: (Long, R)) = { println(label + " " + format.format(result._1)) }   private val FACTOR = 5 private val M = 80 private val N = 70 private val P = 60   def main(args : Array[String]) = { for (trial <- 3 until 0 by -1) { val factor = (if (System.getProperty("factor") != null) Integer.parseInt(System.getProperty("factor")) else FACTOR) val multiplier = if (trial == 1) factor else 1; val m = M * multiplier val n = N * multiplier val p = P * multiplier val a = new Array[Array[Double]](m,n) val b = new Array[Array[Double]](n,p) val c = new Array[Array[Double]](m,p) println("\nMultiply c[" + m + "," + p + "] = a[" + m + "," + n + "] times b[" + n + "," + p + "]\n"); val whileTime = time(matMulUsingWhileLoop(a,b,c)) val iterTime = time(matMulUsingIterators(a,b,c)) report("Iterators ", iterTime) report("Limits ", time(matMulUsingLimits(a,b,c))) report("Ranges ", time(matMulUsingRanges(a,b,c))) report("While Loop ", whileTime) println("MatMul by Iterators is " + iterTime._1 / whileTime._1 + " times as slow as with while loops.") } } }  
          Hide
          jshore Jonathan Shore added a comment -

          This is a very important performance enhancement for numerical work.

          Seems that this could be solved with some pattern matching in the compiler, recognizing a range with no filtering (in the simplest case). Could then remap to a while form for the next stage.

          Show
          jshore Jonathan Shore added a comment - This is a very important performance enhancement for numerical work. Seems that this could be solved with some pattern matching in the compiler, recognizing a range with no filtering (in the simplest case). Could then remap to a while form for the next stage.
          Hide
          sciss Sciss added a comment -

          +1

          Show
          sciss Sciss added a comment - +1
          Hide
          kbarros Kipton Barros added a comment -

          +1

          Show
          kbarros Kipton Barros added a comment - +1
          Hide
          villane Erkki Lindpere added a comment -

          +1

          It would be nice if for-comprehensions with simple filters could be optimized as well, turning
          <pre>for (i <- 1 to 10 if shouldProcess) {}</pre>
          into
          <pre>var i = 1
          while (i < 10) {
          if (shouldProcess) {
          }
          i += 1
          }

          And extra nice if this would work with random access sequences.

          Show
          villane Erkki Lindpere added a comment - +1 It would be nice if for-comprehensions with simple filters could be optimized as well, turning <pre>for (i <- 1 to 10 if shouldProcess ) {}</pre> into <pre>var i = 1 while (i < 10) { if (shouldProcess ) { } i += 1 } And extra nice if this would work with random access sequences.
          Hide
          phdp Philippe added a comment -

          +1

          This is actually the only thing keeping me from using Scala.

          Show
          phdp Philippe added a comment - +1 This is actually the only thing keeping me from using Scala.
          Hide
          dragos Iulian Dragos added a comment -

          Replying to [comment:12 PhDP]:
          > +1
          >
          > This is actually the only thing keeping me from using Scala.
          Have you tried '-optimize'? It can help a lot. It's very unlikely this will move from library to compiler-generated loops.

          Show
          dragos Iulian Dragos added a comment - Replying to [comment:12 PhDP] : > +1 > > This is actually the only thing keeping me from using Scala. Have you tried '-optimize'? It can help a lot. It's very unlikely this will move from library to compiler-generated loops.
          Hide
          akiezun Adam Kiezun added a comment -

          +1

          Show
          akiezun Adam Kiezun added a comment - +1
          Hide
          mgarcia Miguel Garcia added a comment -

          I haven't benchmarked (with and without -optimize) to see whether the current compilation scheme for "simple loops" is good enough.

          But in case it isn't, looks like the single place to change in the compiler is method TreeBuilder.makeFor().

          According to its comment,

          http://lampsvn.epfl.ch/trac/scala/browser/scala/trunk/src/compiler/scala/tools/nsc/ast/parser/TreeBuilder.scala

          it performs five transformations. Prepending as special case a transformation for "simple loops" would not change semantics. (Well, assuming that a local definition does not shadow the usual ones: "to" in "1 to 10", "Range", and so on)

          Miguel
          http://www.sts.tu-harburg.de/people/mi.garcia/

          Show
          mgarcia Miguel Garcia added a comment - I haven't benchmarked (with and without -optimize) to see whether the current compilation scheme for "simple loops" is good enough. But in case it isn't, looks like the single place to change in the compiler is method TreeBuilder.makeFor(). According to its comment, http://lampsvn.epfl.ch/trac/scala/browser/scala/trunk/src/compiler/scala/tools/nsc/ast/parser/TreeBuilder.scala it performs five transformations. Prepending as special case a transformation for "simple loops" would not change semantics. (Well, assuming that a local definition does not shadow the usual ones: "to" in "1 to 10", "Range", and so on) Miguel http://www.sts.tu-harburg.de/people/mi.garcia/
          Hide
          mellit Anton Mellit added a comment -

          I tried to create a script with the following:

          def timeit(f : () => Unit) {
              val t1 = System.currentTimeMillis()
              f()
              val t2 = System.currentTimeMillis()
           
              println(t2-t1)
          }    
           
          def repeat(n : Int, f : Int => Unit) : Unit = {
              var i = 0
              while (i<n) {
                  f(i)
                  i += 1
              }
          }
           
          def test0() {
              var i = 0
              var sum = 0
              while (i < 1000000000) {
                  sum += i
                  i += 1
              }
              println(sum)
          }
           
          def test1() {
              var sum = 0
              repeat(1000000000, i => {
                  sum += i
              })
              println(sum)
          }
           
          def test2() {
              var sum = 0
              for(i <- 0 until 1000000000) {
                  sum += i
              }
              println(sum)
          }
           
          timeit(test0)
          timeit(test1)
          timeit(test2)

          Result is:

          -1243309312
          467
          -1243309312
          504
          -1243309312
          11899

          May be this 'repeat' is a workaround? Warning: works only with 'scala -optimise'. This is not very stable, sometimes some seemingly minor modifications, i.e. moving the code outside of the function, break it and I get 12000 for 'repeat'.

          Show
          mellit Anton Mellit added a comment - I tried to create a script with the following: def timeit(f : () => Unit) { val t1 = System.currentTimeMillis() f() val t2 = System.currentTimeMillis()   println(t2-t1) }   def repeat(n : Int, f : Int => Unit) : Unit = { var i = 0 while (i<n) { f(i) i += 1 } }   def test0() { var i = 0 var sum = 0 while (i < 1000000000) { sum += i i += 1 } println(sum) }   def test1() { var sum = 0 repeat(1000000000, i => { sum += i }) println(sum) }   def test2() { var sum = 0 for(i <- 0 until 1000000000) { sum += i } println(sum) }   timeit(test0) timeit(test1) timeit(test2) Result is: -1243309312 467 -1243309312 504 -1243309312 11899 May be this 'repeat' is a workaround? Warning: works only with 'scala -optimise'. This is not very stable, sometimes some seemingly minor modifications, i.e. moving the code outside of the function, break it and I get 12000 for 'repeat'.
          Hide
          djb David Biesack added a comment -

          Replying to [comment:28 mellit]:
          > I tried to create a script with the following:
          >
          >

           


          > ...
          > def repeat(n : Int, f : Int => Unit) : Unit = {
          > var i = 0
          > while (i<n)

          { > f(i) > i += 1 > }

          > }
          > }}

          Thanks for the suggestion. I hesitate to try or rely on this for several
          reasons:

          1) having to remember to not use the standard for loop is problematic. The syntax is different and not based on generators and less general, although one could certainly patch it to be more general; i.e pass in start and end and optional increment values: repeat(start:Int, end:Int, increment:Int) or repeat(range:Range) etc. (and also handle backwards iteration when called for).

          2) This may still involve at least an extra function call per iteration. If the compiler has to inject other synthetic calls into the body to allow access to other lexically scoped variables, this may also affect performance. The goal of this optimization is to achieve Java-level performance of for loops where possible.

          3) most critically, if/when the Scala compiler implements this ticket's optimization, then all code using this repeat control would not get the optimization, requiring code maintenance to undo it.

          Show
          djb David Biesack added a comment - Replying to [comment:28 mellit] : > I tried to create a script with the following: > >   > ... > def repeat(n : Int, f : Int => Unit) : Unit = { > var i = 0 > while (i<n) { > f(i) > i += 1 > } > } > }} Thanks for the suggestion. I hesitate to try or rely on this for several reasons: 1) having to remember to not use the standard for loop is problematic. The syntax is different and not based on generators and less general, although one could certainly patch it to be more general; i.e pass in start and end and optional increment values: repeat(start:Int, end:Int, increment:Int) or repeat(range:Range) etc. (and also handle backwards iteration when called for). 2) This may still involve at least an extra function call per iteration. If the compiler has to inject other synthetic calls into the body to allow access to other lexically scoped variables, this may also affect performance. The goal of this optimization is to achieve Java-level performance of for loops where possible. 3) most critically, if/when the Scala compiler implements this ticket's optimization, then all code using this repeat control would not get the optimization, requiring code maintenance to undo it.
          Hide
          extempore Paul Phillips added a comment -

          Please see update on this ticket sent to scala-user, also available here: http://scala-programming-language.1934581.n4.nabble.com/optimizing-simple-fors-td2545502.html#a2545502

          I very much agree with mgarcia's comment of nine months ago that TreeBuilder.makeFor already does a whole pile of tree transformations and there is no convincing reason we shouldn't add one which has this kind of impact. Failing agreement on that point, I believe we have a pressing responsibility to clean up the parsing phase and plugin architecture sufficiently that it would be possible to do this transformation with a compiler plugin.

          Show
          extempore Paul Phillips added a comment - Please see update on this ticket sent to scala-user, also available here: http://scala-programming-language.1934581.n4.nabble.com/optimizing-simple-fors-td2545502.html#a2545502 I very much agree with mgarcia's comment of nine months ago that TreeBuilder.makeFor already does a whole pile of tree transformations and there is no convincing reason we shouldn't add one which has this kind of impact. Failing agreement on that point, I believe we have a pressing responsibility to clean up the parsing phase and plugin architecture sufficiently that it would be possible to do this transformation with a compiler plugin.
          Hide
          olivier.chafik Olivier Chafik added a comment -

          Hello,

          I've just written such a compiler plugin, which you can install straight away on 2.8.0 with sbaz for testing :
          http://article.gmane.org/gmane.comp.lang.scala.user/31814

          It's probably full of bugs, but it rewrites int-range for loops with no filters and Array[T].foreach, Array[T].map into equivalent while loops.

          You can have a look at the auto tests to see the supported cases :
          http://code.google.com/p/nativelibs4java/source/browse/branches/OpenCL-BridJ/libraries/OpenCL/ScalaCLPlugin/src/test/scala/scalacl/

          Looking forward to seeing something like this mainstream
          Cheers

          zOlive

          Show
          olivier.chafik Olivier Chafik added a comment - Hello, I've just written such a compiler plugin, which you can install straight away on 2.8.0 with sbaz for testing : http://article.gmane.org/gmane.comp.lang.scala.user/31814 It's probably full of bugs, but it rewrites int-range for loops with no filters and Array [T] .foreach, Array [T] .map into equivalent while loops. You can have a look at the auto tests to see the supported cases : http://code.google.com/p/nativelibs4java/source/browse/branches/OpenCL-BridJ/libraries/OpenCL/ScalaCLPlugin/src/test/scala/scalacl/ Looking forward to seeing something like this mainstream Cheers – zOlive
          Hide
          olivier.chafik Olivier Chafik added a comment -

          I've adapted the fannkuch and nbody benchmarks that were in the scala-user thread mentioned previously and I had to adapt it a bit (inlining the ranges that were stored as val range = x until y).

          Here's the modified code :
          http://nativelibs4java.sourceforge.net/sbaz/scalacl/examples/

          And to run it (with ScalaCL plugin installed via sbaz: sbaz install scalacl-compiler-plugin) :
          DISABLE_SCALACL_PLUGIN=1 scalac fannkuch.scala && scala fannkuch
          scalac fannkuch.scala && scala fannkuch

          DISABLE_SCALACL_PLUGIN=1 scalac nbody.scala && scala nbody
          scalac nbody.scala && scala nbody

          With the plugin turned on, the performance of the three variants (While, Limit, Range) is the same (the first while is actually slower, I haven't investigated why).

          Show
          olivier.chafik Olivier Chafik added a comment - I've adapted the fannkuch and nbody benchmarks that were in the scala-user thread mentioned previously and I had to adapt it a bit (inlining the ranges that were stored as val range = x until y). Here's the modified code : http://nativelibs4java.sourceforge.net/sbaz/scalacl/examples/ And to run it (with ScalaCL plugin installed via sbaz: sbaz install scalacl-compiler-plugin) : DISABLE_SCALACL_PLUGIN=1 scalac fannkuch.scala && scala fannkuch scalac fannkuch.scala && scala fannkuch DISABLE_SCALACL_PLUGIN=1 scalac nbody.scala && scala nbody scalac nbody.scala && scala nbody With the plugin turned on, the performance of the three variants (While, Limit, Range) is the same (the first while is actually slower, I haven't investigated why).
          Hide
          olivier.chafik Olivier Chafik added a comment -

          (Sorry for spamming you again, this should be the last time)

          I've just enhanced the plugin with more conversions to while loops :

          • Array.foldLeft / foldRight
          • Array.reduceLeft / reduceRight
          • Array.scanLeft / scanRight
            (in addition to Array.foreach, Array.map and the int range foreach with no filter)

          Also, the conversions should now work on method references and inline lambdas the same way.

          Further progress and plans can be tracked at the bottom of this page :
          http://code.google.com/p/scalacl/wiki/ScalaCLPlugin

          Show
          olivier.chafik Olivier Chafik added a comment - (Sorry for spamming you again, this should be the last time) I've just enhanced the plugin with more conversions to while loops : Array.foldLeft / foldRight Array.reduceLeft / reduceRight Array.scanLeft / scanRight (in addition to Array.foreach, Array.map and the int range foreach with no filter) Also, the conversions should now work on method references and inline lambdas the same way. Further progress and plans can be tracked at the bottom of this page : http://code.google.com/p/scalacl/wiki/ScalaCLPlugin
          rytz Lukas Rytz made changes -
          Field Original Value New Value
          Workflow jira [ 20547 ] scala [ 24423 ]
          ymasory Yuvi Masory made changes -
          Workflow scala [ 24423 ] Open-Closed Workflow v3 [ 29883 ]
          rytz Lukas Rytz made changes -
          Assignee Iulian Dragos [ dragos ]
          Hide
          herrkhosse Herrmann Khosse added a comment -

          +1

          Show
          herrkhosse Herrmann Khosse added a comment - +1
          Hide
          sethtisue Seth Tisue added a comment -

          "Spire also provides a loop macro called cfor whose syntax bears a slight resemblance to a traditional for-loop from C or Java. This macro expands to a tail-recursive function, which will inline literal function arguments." https://github.com/non/spire

          Show
          sethtisue Seth Tisue added a comment - "Spire also provides a loop macro called cfor whose syntax bears a slight resemblance to a traditional for-loop from C or Java. This macro expands to a tail-recursive function, which will inline literal function arguments." https://github.com/non/spire
          Hide
          ochafik Olivier Chafik added a comment -

          Here's another loop macro with an arguably better syntax: Scalaxy/Loops (which reuses code from ScalaCL):

          https://github.com/ochafik/Scalaxy/tree/master/Loops

          Show
          ochafik Olivier Chafik added a comment - Here's another loop macro with an arguably better syntax: Scalaxy/Loops (which reuses code from ScalaCL): https://github.com/ochafik/Scalaxy/tree/master/Loops
          antoras Simon Schäfer made changes -
          Description I would like to suggest the compiler optimize the common case of for loops, that is,

          for (var <- Range [by step])
          for (var <- int to int [by step])
          for (var <- int until int [by step])

          to use while loops under the covers instead of. Currently, nested for loops using ranges/iterators are sometimes an order of magnitude slower than while loops. However, while loop constructs for iterating over arrays are very cumbersome, and the functional style (foreach) is also cumbersome and introduces a lot of function call overhead as well.

          The following two matrix multipication implementations


            def matMulUsingIterators (
                 a : Array[Array[Double]],
                 b : Array[Array[Double]],
                 c : Array[Array[Double]]) : Unit = {

              val b_j = new Array[Double](b.length)

              for (j <- 0 until b(0).length) {
                  for (k <- 0 until b.length) {
                      b_j(k) = b(k)(j)
                  }
                  for (i <- 0 until a.length) {
                      val c_i = c(i)
                      val a_i = a(i)
                      var s = 0.0d;
                      for (k <- 0 until b.length) {
                          s += a_i(k) * b_j(k)
                      }
                      c_i(j) = s
                  }
              }
            }

            def matMulUsingRanges (
                 a : Array[Array[Double]],
                 b : Array[Array[Double]],
                 c : Array[Array[Double]]) : Unit = {

              val jRange = 0 until b(0).length;
              val kRange = 0 until b.length;
              val iRange = 0 until a.length;

              val b_j = new Array[Double](b.length)

              for (j <- jRange) {
                  for (k <- kRange) {
                      b_j(k) = b(k)(j)
                  }
                  for (i <- iRange) {
                      val c_i = c(i);
                      val a_i = a(i);
                      var s = 0.0d;
                      for (k <- kRange) {
                          s += a_i(k) * b_j(k)
                      }
                      c_i(j) = s
                  }
              }
            }


          are much slower than the same algorithm coded with while loops:
            def matMulUsingWhileLoop (
                a : Array[Array[Double]],
                b : Array[Array[Double]],
                c : Array[Array[Double]]) : Unit = {

              val m = a.length;
              val p = b(0).length;
              val n = b.length;

              val b_j = new Array[Double](b.length);

              var i = 0; var j = 0; var k = 0;
              while (j < p) {
                  k = 0
                  while (k < n) {
                      b_j(k) = b(k)(j);
                      k += 1
                  }
                  i = 0
                  while (i < m) {
                      val c_i = c(i);
                      val a_i = a(i);
                      var s = 0.0d;
                      k = 0;
                      while (k < n) {
                          s += a_i(k) * b_j(k);
                          k += 1
                      }
                      c_i(j) = s;
                      i += 1
                  }
                  j += 1;
              }
            }

          but the while loop code is more complex and error prone.

          (Sorry, Trac appears to remove some line breaks; I
          added some explicit semis but might have missed some;
          I'll try attaching actual working source code)

          Running this while measuring time in nanoseconds:

          Iterators 2,807,815,301ns
          Ranges 2,789,958,191ns
          While Loop 190,778,574ns
          MatMul by Iterators is 14 times as slow as with while loops.

          It does not appear that the Hotspot runtime profiling and optimization dramatically helps this performance problem
          This performance problem can hurt adoption of Scala for many types of uses/applications.
          I would like to suggest the compiler optimize the common case of for loops, that is,

          {code}
          for (var <- Range [by step])
          for (var <- int to int [by step])
          for (var <- int until int [by step])
          {code}

          to use while loops under the covers instead of. Currently, nested for loops using ranges/iterators are sometimes an order of magnitude slower than while loops. However, while loop constructs for iterating over arrays are very cumbersome, and the functional style (foreach) is also cumbersome and introduces a lot of function call overhead as well.

          The following two matrix multipication implementations

          {code}
            def matMulUsingIterators (
                 a : Array[Array[Double]],
                 b : Array[Array[Double]],
                 c : Array[Array[Double]]) : Unit = {

              val b_j = new Array[Double](b.length)

              for (j <- 0 until b(0).length) {
                  for (k <- 0 until b.length) {
                      b_j(k) = b(k)(j)
                  }
                  for (i <- 0 until a.length) {
                      val c_i = c(i)
                      val a_i = a(i)
                      var s = 0.0d;
                      for (k <- 0 until b.length) {
                          s += a_i(k) * b_j(k)
                      }
                      c_i(j) = s
                  }
              }
            }

            def matMulUsingRanges (
                 a : Array[Array[Double]],
                 b : Array[Array[Double]],
                 c : Array[Array[Double]]) : Unit = {

              val jRange = 0 until b(0).length;
              val kRange = 0 until b.length;
              val iRange = 0 until a.length;

              val b_j = new Array[Double](b.length)

              for (j <- jRange) {
                  for (k <- kRange) {
                      b_j(k) = b(k)(j)
                  }
                  for (i <- iRange) {
                      val c_i = c(i);
                      val a_i = a(i);
                      var s = 0.0d;
                      for (k <- kRange) {
                          s += a_i(k) * b_j(k)
                      }
                      c_i(j) = s
                  }
              }
            }
          {code}

          are much slower than the same algorithm coded with while loops:

          {code}
            def matMulUsingWhileLoop (
                a : Array[Array[Double]],
                b : Array[Array[Double]],
                c : Array[Array[Double]]) : Unit = {

              val m = a.length;
              val p = b(0).length;
              val n = b.length;

              val b_j = new Array[Double](b.length);

              var i = 0; var j = 0; var k = 0;
              while (j < p) {
                  k = 0
                  while (k < n) {
                      b_j(k) = b(k)(j);
                      k += 1
                  }
                  i = 0
                  while (i < m) {
                      val c_i = c(i);
                      val a_i = a(i);
                      var s = 0.0d;
                      k = 0;
                      while (k < n) {
                          s += a_i(k) * b_j(k);
                          k += 1
                      }
                      c_i(j) = s;
                      i += 1
                  }
                  j += 1;
              }
            }
          {code}

          but the while loop code is more complex and error prone.

          (Sorry, Trac appears to remove some line breaks; I
          added some explicit semis but might have missed some;
          I'll try attaching actual working source code)

          Running this while measuring time in nanoseconds:

          {code}
          Iterators 2,807,815,301ns
          Ranges 2,789,958,191ns
          While Loop 190,778,574ns
          {code}

          MatMul by Iterators is 14 times as slow as with while loops.

          It does not appear that the Hotspot runtime profiling and optimization dramatically helps this performance problem
          This performance problem can hurt adoption of Scala for many types of uses/applications.
          Hide
          darkdimius Dmitry Petrashko added a comment -

          There's a different approach that I've tried in scalaBlitz: dont require users switch from using standard library while they code, but instead give a macro that changes implementation methods, replacing standard library implementation with macro-based one.
          Here's small description of it: http://scala-blitz.github.io/home/documentation//optimize.html

          The Range example in this ticket will be compiled to while loops and get same performance.

          Show
          darkdimius Dmitry Petrashko added a comment - There's a different approach that I've tried in scalaBlitz: dont require users switch from using standard library while they code, but instead give a macro that changes implementation methods, replacing standard library implementation with macro-based one. Here's small description of it: http://scala-blitz.github.io/home/documentation//optimize.html The Range example in this ticket will be compiled to while loops and get same performance.
          moors Adriaan Moors made changes -
          Workflow Open-Closed Workflow v3 [ 29883 ] Copy of Open-Closed Workflow v3 [ 40216 ]
          sethtisue Seth Tisue made changes -
          Fix Version/s Backlog [ 11701 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              djb David Biesack
              TracCC:
              Adam Kiezun, Alex Cruise, Alexey Romanov, Andrew McCallum, Anton Mellit, Carlos Lopez, Christos KK Loverdos, Daniel Sobral, Erkki Lindpere, federico silva, Ismael Juma, Johannes Rudolph, Jonathan Shore, Lachlan Deck, Miguel Garcia, Mirko Stocker, Olivier Chafik, Rüdiger Keller, Sbastien Bocq, Seth Tisue, spiros, turicum
            • Votes:
              36 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

              • Created:
                Updated: