Optimize simple for loops #1338

scabug · 2008-09-15T15:06:04Z

I would like to suggest the compiler optimize the common case of for loops, that is,

for (var <- Range [by step])
for (var <- int to int [by step])
for (var <- int until int [by step])

to use while loops under the covers instead of. Currently, nested for loops using ranges/iterators are sometimes an order of magnitude slower than while loops. However, while loop constructs for iterating over arrays are very cumbersome, and the functional style (foreach) is also cumbersome and introduces a lot of function call overhead as well.

The following two matrix multipication implementations

  def matMulUsingIterators (
       a : Array[Array[Double]],
       b : Array[Array[Double]],
       c : Array[Array[Double]]) : Unit = {

    val b_j = new Array[Double](b.length)

    for (j <- 0 until b(0).length) {
        for (k <- 0 until b.length) {
            b_j(k) = b(k)(j)
        }
        for (i <- 0 until a.length) {
            val c_i = c(i)
            val a_i = a(i)
            var s = 0.0d;
            for (k <- 0 until b.length) {
                s += a_i(k) * b_j(k)
            }
            c_i(j) = s
        }
    }
  }

  def matMulUsingRanges (
       a : Array[Array[Double]],
       b : Array[Array[Double]],
       c : Array[Array[Double]]) : Unit = {

    val jRange = 0 until b(0).length;
    val kRange = 0 until b.length;
    val iRange = 0 until a.length;

    val b_j = new Array[Double](b.length)

    for (j <- jRange) {
        for (k <- kRange) {
            b_j(k) = b(k)(j)
        }
        for (i <- iRange) {
            val c_i = c(i);
            val a_i = a(i);
            var s = 0.0d;
            for (k <- kRange) {
                s += a_i(k) * b_j(k)
            }
            c_i(j) = s
        }
    }
  }

are much slower than the same algorithm coded with while loops:

  def matMulUsingWhileLoop (
      a : Array[Array[Double]],
      b : Array[Array[Double]],
      c : Array[Array[Double]]) : Unit = {

    val m = a.length;
    val p = b(0).length;
    val n = b.length;

    val b_j = new Array[Double](b.length);

    var i = 0; var j = 0; var k = 0;
    while (j < p) {
        k = 0
        while (k < n) {
            b_j(k) = b(k)(j);
            k += 1
        }
        i = 0
        while (i < m) {
            val c_i = c(i);
            val a_i = a(i);
            var s = 0.0d;
            k = 0;
            while (k < n) {
                s += a_i(k) * b_j(k);
                k += 1
            }
            c_i(j) = s;
            i += 1
        }
        j += 1;
    }
  }

but the while loop code is more complex and error prone.

(Sorry, Trac appears to remove some line breaks; I
added some explicit semis but might have missed some;
I'll try attaching actual working source code)

Running this while measuring time in nanoseconds:

Iterators   2,807,815,301ns
Ranges      2,789,958,191ns
While Loop  190,778,574ns

MatMul by Iterators is 14 times as slow as with while loops.

It does not appear that the Hotspot runtime profiling and optimization dramatically helps this performance problem
This performance problem can hurt adoption of Scala for many types of uses/applications.

scabug · 2008-09-15T15:06:04Z

Imported From: https://issues.scala-lang.org/browse/SI-1338?orig=1
Reporter: @DavidBiesack

scabug · 2008-09-15T15:08:54Z

@DavidBiesack said:

// Scala code to compare performance of nested int loops
object MatMul {


  def matMulUsingIterators (
       a : Array[Array[Double]],
       b : Array[Array[Double]],
       c : Array[Array[Double]]) : Unit = {

    val b_j = new Array[Double](b.length)

    for (j <- 0 until b(0).length) {
        for (k <- 0 until b.length) {
            b_j(k) = b(k)(j)
        }
        for (i <- 0 until a.length) {
            val c_i = c(i)
            val a_i = a(i)
            var s = 0.0d;
            for (k <- 0 until b.length) {
                s += a_i(k) * b_j(k)
            }
            c_i(j) = s
        }
    }
  }

  def matMulUsingRanges (
       a : Array[Array[Double]],
       b : Array[Array[Double]],
       c : Array[Array[Double]]) : Unit = {

    val jRange = 0 until b(0).length // p
    val kRange = 0 until b.length // n
    val iRange = 0 until a.length // m

    val b_j = new Array[Double](b.length)

    for (j <- jRange) {
        for (k <- kRange) {
            b_j(k) = b(k)(j)
        }
        for (i <- iRange) {
            val c_i = c(i)
            val a_i = a(i)
            var s = 0.0d;
            for (k <- kRange) {
                s += a_i(k) * b_j(k)
            }
            c_i(j) = s
        }
    }
  }

  def matMulUsingLimits (
       a : Array[Array[Double]],
       b : Array[Array[Double]],
       c : Array[Array[Double]]) : Unit = {

    val b_j = new Array[Double](b.length)

    val m = a.length
    val p = b(0).length
    val n = b.length

    for (j <- 0 until p) {
        for (k <- 0 until n) {
            b_j(k) = b(k)(j)
        }
        for (i <- 0 until m) {
            val c_i = c(i)
            val a_i = a(i)
            var s = 0.0d;
            for (k <- 0 until n) {
                s += a_i(k) * b_j(k)
            }
            c_i(j) = s
        }
    }
  }

  def matMulUsingWhileLoop (
      a : Array[Array[Double]],
      b : Array[Array[Double]],
      c : Array[Array[Double]]) : Unit = {

    val m = a.length
    val p = b(0).length
    val n = b.length

    val b_j = new Array[Double](b.length)

    var i = 0; var j = 0; var k = 0
    while (j < p) {
        k = 0
        while (k < n) {
            b_j(k) = b(k)(j)
            k += 1
        }
        i = 0
        while (i < m) {
            val c_i = c(i)
            val a_i = a(i)
            var s = 0.0d;
            k = 0
            while (k < n) {
                s += a_i(k) * b_j(k)
                k += 1
            }
            c_i(j) = s
            i += 1
        }
        j += 1
    }
  }

  def time[R](block: => R) : (Long, R) = {
    val start = System.nanoTime()
    val result : R = block
    val time = System.nanoTime() - start
    (time, result)
  }

  val format = new java.text.DecimalFormat("0,000'ns'");
  def report[R](label: String, result: (Long, R)) = {
     println(label + " " + format.format(result._1))
     }

    private val FACTOR = 5
    private val M = 80
    private val N = 70
    private val P = 60

    def main(args : Array[String]) = {
      for (trial <- 3 until 0 by -1) {
      val factor = (if (System.getProperty("factor") != null)
                      Integer.parseInt(System.getProperty("factor"))
                    else
                        FACTOR)
      val multiplier = if (trial == 1) factor else 1;
      val m = M * multiplier
      val n = N * multiplier
      val p = P * multiplier
      val a = new Array[Array[Double]](m,n)
      val b = new Array[Array[Double]](n,p)
      val c = new Array[Array[Double]](m,p)
      println("\nMultiply c[" + m + "," + p + "] = a[" + m + "," + n + "] times b[" + n + "," + p + "]\n");
      val whileTime = time(matMulUsingWhileLoop(a,b,c))
      val iterTime = time(matMulUsingIterators(a,b,c))
      report("Iterators  ", iterTime)
      report("Limits     ", time(matMulUsingLimits(a,b,c)))
      report("Ranges     ", time(matMulUsingRanges(a,b,c)))
      report("While Loop ", whileTime)
      println("MatMul by Iterators is " + iterTime._1 / whileTime._1 + " times as slow as with while loops.")
      }
  }
}

scabug · 2009-04-10T18:20:30Z

Jonathan Shore (jshore) said:
This is a very important performance enhancement for numerical work.

Seems that this could be solved with some pattern matching in the compiler, recognizing a range with no filtering (in the simplest case). Could then remap to a while form for the next stage.

scabug · 2009-04-10T18:44:56Z

@Sciss said:
+1

scabug · 2009-04-10T19:26:09Z

Kipton Barros (kbarros) said:
+1

scabug · 2009-05-15T22:22:29Z

Erkki Lindpere (villane) said:
+1

It would be nice if for-comprehensions with simple filters could be optimized as well, turning

for (i <- 1 to 10 if shouldProcess(i)) {}

into

var i = 1
while (i < 10) {
  if (shouldProcess(i)) {
  }
  i += 1
}

And extra nice if this would work with random access sequences.

scabug · 2009-10-22T18:43:53Z

Philippe (phdp) said:
+1

This is actually the only thing keeping me from using Scala.

scabug · 2009-11-30T09:47:47Z

@dragos said:
Replying to [comment:12 PhDP]:

+1

This is actually the only thing keeping me from using Scala.
Have you tried '-optimize'? It can help a lot. It's very unlikely this will move from library to compiler-generated loops.

scabug · 2009-12-20T01:02:28Z

Adam Kiezun (akiezun) said:
+1

scabug · 2009-12-26T18:55:29Z

Miguel Garcia (mgarcia) said:

I haven't benchmarked (with and without -optimize) to see whether the current compilation scheme for "simple loops" is good enough.

But in case it isn't, looks like the single place to change in the compiler is method TreeBuilder.makeFor().

According to its comment,

http://lampsvn.epfl.ch/trac/scala/browser/scala/trunk/src/compiler/scala/tools/nsc/ast/parser/TreeBuilder.scala

it performs five transformations. Prepending as special case a transformation for "simple loops" would not change semantics. (Well, assuming that a local definition does not shadow the usual ones: "to" in "1 to 10", "Range", and so on)

Miguel
http://www.sts.tu-harburg.de/people/mi.garcia/

scabug · 2010-07-05T23:50:03Z

Anton Mellit (mellit) said:
I tried to create a script with the following:

def timeit(f : () => Unit) {
    val t1 = System.currentTimeMillis()
    f()
    val t2 = System.currentTimeMillis()

    println(t2-t1)
}    

def repeat(n : Int, f : Int => Unit) : Unit = {
    var i = 0
    while (i<n) {
        f(i)
        i += 1
    }
}

def test0() {
    var i = 0
    var sum = 0
    while (i < 1000000000) {
        sum += i
        i += 1
    }
    println(sum)
}

def test1() {
    var sum = 0
    repeat(1000000000, i => {
        sum += i
    })
    println(sum)
}

def test2() {
    var sum = 0
    for(i <- 0 until 1000000000) {
        sum += i
    }
    println(sum)
}

timeit(test0)
timeit(test1)
timeit(test2)

Result is:

-1243309312
467
-1243309312
504
-1243309312
11899

May be this 'repeat' is a workaround? Warning: works only with 'scala -optimise'. This is not very stable, sometimes some seemingly minor modifications, i.e. moving the code outside of the function, break it and I get 12000 for 'repeat'.

scabug · 2010-07-06T11:22:58Z

@DavidBiesack said:
Replying to [comment:28 mellit]:

I tried to create a script with the following:

{code}
...
def repeat(n : Int, f : Int => Unit) : Unit = {
var i = 0
while (i<n) {
f(i)
i += 1
}
}
}}

Thanks for the suggestion. I hesitate to try or rely on this for several
reasons:

having to remember to not use the standard for loop is problematic. The syntax is different and not based on generators and less general, although one could certainly patch it to be more general; i.e pass in start and end and optional increment values: repeat(start:Int, end:Int, increment:Int) or repeat(range:Range) etc. (and also handle backwards iteration when called for).
This may still involve at least an extra function call per iteration. If the compiler has to inject other synthetic calls into the body to allow access to other lexically scoped variables, this may also affect performance. The goal of this optimization is to achieve Java-level performance of for loops where possible.
most critically, if/when the Scala compiler implements this ticket's optimization, then all code using this repeat control would not get the optimization, requiring code maintenance to undo it.

scabug · 2010-09-18T18:45:50Z

@paulp said:
Please see update on this ticket sent to scala-user, also available here: http://scala-programming-language.1934581.n4.nabble.com/optimizing-simple-fors-td2545502.html#a2545502

I very much agree with mgarcia's comment of nine months ago that TreeBuilder.makeFor already does a whole pile of tree transformations and there is no convincing reason we shouldn't add one which has this kind of impact. Failing agreement on that point, I believe we have a pressing responsibility to clean up the parsing phase and plugin architecture sufficiently that it would be possible to do this transformation with a compiler plugin.

scabug · 2010-10-03T23:08:47Z

Olivier Chafik (olivier.chafik) said:
Hello,

I've just written such a compiler plugin, which you can install straight away on 2.8.0 with sbaz for testing :
http://article.gmane.org/gmane.comp.lang.scala.user/31814

It's probably full of bugs, but it rewrites int-range for loops with no filters and Array[T].foreach, Array[T].map into equivalent while loops.

You can have a look at the auto tests to see the supported cases :
http://code.google.com/p/nativelibs4java/source/browse/branches/OpenCL-BridJ/libraries/OpenCL/ScalaCLPlugin/src/test/scala/scalacl/

Looking forward to seeing something like this mainstream :-)
Cheers

zOlive

scabug · 2010-10-04T10:33:18Z

Olivier Chafik (olivier.chafik) said:
I've adapted the fannkuch and nbody benchmarks that were in the scala-user thread mentioned previously and I had to adapt it a bit (inlining the ranges that were stored as val range = x until y).

Here's the modified code :
http://nativelibs4java.sourceforge.net/sbaz/scalacl/examples/

And to run it (with ScalaCL plugin installed via sbaz: sbaz install scalacl-compiler-plugin) :
DISABLE_SCALACL_PLUGIN=1 scalac fannkuch.scala && scala fannkuch
scalac fannkuch.scala && scala fannkuch

DISABLE_SCALACL_PLUGIN=1 scalac nbody.scala && scala nbody
scalac nbody.scala && scala nbody

With the plugin turned on, the performance of the three variants (While, Limit, Range) is the same (the first while is actually slower, I haven't investigated why).

scabug · 2010-10-06T00:45:52Z

Olivier Chafik (olivier.chafik) said:
(Sorry for spamming you again, this should be the last time)

I've just enhanced the plugin with more conversions to while loops :

Array.foldLeft / foldRight
Array.reduceLeft / reduceRight
Array.scanLeft / scanRight
(in addition to Array.foreach, Array.map and the int range foreach with no filter)

Also, the conversions should now work on method references and inline lambdas the same way.

Further progress and plans can be tracked at the bottom of this page :
http://code.google.com/p/scalacl/wiki/ScalaCLPlugin

scabug · 2012-11-15T23:10:53Z

Herrmann Khosse (herrkhosse) said:
+1

scabug · 2013-07-21T16:36:15Z

@SethTisue said:
"Spire also provides a loop macro called cfor whose syntax bears a slight resemblance to a traditional for-loop from C or Java. This macro expands to a tail-recursive function, which will inline literal function arguments." https://github.com/non/spire

scabug · 2013-07-21T19:19:03Z

Olivier Chafik (ochafik) said:
Here's another loop macro with an arguably better syntax: Scalaxy/Loops (which reuses code from ScalaCL):

https://github.com/ochafik/Scalaxy/tree/master/Loops

scabug · 2014-03-28T17:01:52Z

@DarkDimius said:
There's a different approach that I've tried in scalaBlitz: dont require users switch from using standard library while they code, but instead give a macro that changes implementation methods, replacing standard library implementation with macro-based one.
Here's small description of it: http://scala-blitz.github.io/home/documentation//optimize.html

The Range example in this ticket will be compiled to while loops and get same performance.

SethTisue · 2019-06-11T10:52:56Z

@lrytz do you agree we should now close this, as per scala/scala#8069 ?

and if we do close it, let's make some noise about it, too!

ashawley · 2019-06-11T16:43:55Z

It would seem the noise making probably could have started 6 years ago back when 2.10 was released.

I copied the benchmark from scala/scala#8069 out of the Scala project to a fresh sbt 0.13 project and ran it against 2.10.7, 2.11.12, 2.12.8 and 2.13.0. My laptop (4 year old Macbook Air) is a little quieter than when I initially did this, so the variance is a little less with these numbers.

2.10.7

Benchmark             (size)  Mode  Cnt       Score      Error  Units
matMulUsingIterators      10  avgt   20        2.32 ±     0.06  us/op
matMulUsingRanges         10  avgt   20        2.46 ±     0.06  us/op
matMulUsingWhileLoop      10  avgt   20        1.17 ±     0.33  us/op
matMulUsingIterators     100  avgt   20     1294.35 ±    26.09  us/op
matMulUsingRanges        100  avgt   20     1547.63 ±    36.67  us/op
matMulUsingWhileLoop     100  avgt   20     1101.28 ±    56.62  us/op
matMulUsingIterators    1000  avgt   20  1450647.85 ± 50986.23  us/op
matMulUsingRanges       1000  avgt   20  1609264.80 ± 37524.73  us/op
matMulUsingWhileLoop    1000  avgt   20  1442805.40 ± 24099.38  us/op

2.11.12

Benchmark             (size)  Mode  Cnt       Score      Error  Units
matMulUsingIterators      10  avgt   20        2.35 ±     0.08  us/op
matMulUsingRanges         10  avgt   20        2.51 ±     0.06  us/op
matMulUsingWhileLoop      10  avgt   20        1.04 ±     0.03  us/op
matMulUsingIterators     100  avgt   20     1297.23 ±    30.70  us/op
matMulUsingRanges        100  avgt   20     1632.09 ±    49.61  us/op
matMulUsingWhileLoop     100  avgt   20     1064.69 ±    25.42  us/op
matMulUsingRanges       1000  avgt   20  1635161.15 ± 39404.86  us/op
matMulUsingIterators    1000  avgt   20  1482990.65 ± 42431.02  us/op
matMulUsingWhileLoop    1000  avgt   20  1438909.85 ± 20257.01  us/op

2.12.8

Benchmark             (size)  Mode  Cnt       Score      Error  Units
matMulUsingIterators      10  avgt   20        2.02 ±     0.06  us/op
matMulUsingRanges         10  avgt   20        2.45 ±     0.06  us/op
matMulUsingWhileLoop      10  avgt   20        0.90 ±     0.02  us/op
matMulUsingIterators     100  avgt   20     1275.96 ±    29.09  us/op
matMulUsingRanges        100  avgt   20     1581.64 ±    50.82  us/op
matMulUsingWhileLoop     100  avgt   20     1081.20 ±    21.98  us/op
matMulUsingIterators    1000  avgt   20  1465832.44 ± 29526.00  us/op
matMulUsingRanges       1000  avgt   20  1631038.32 ± 45846.02  us/op
matMulUsingWhileLoop    1000  avgt   20  1419277.53 ± 15006.24  us/op

2.13.0

Benchmark             (size)  Mode  Cnt       Score      Error  Units
matMulUsingIterators      10  avgt   20        6.64 ±     0.23  us/op
matMulUsingRanges         10  avgt   20        2.98 ±     0.08  us/op
matMulUsingWhileLoop      10  avgt   20        0.92 ±     0.02  us/op
matMulUsingIterators     100  avgt   20     2679.84 ±    50.97  us/op
matMulUsingRanges        100  avgt   20     2045.60 ±    46.93  us/op
matMulUsingWhileLoop     100  avgt   20     1088.56 ±    29.83  us/op
matMulUsingIterators    1000  avgt   20  2391493.44 ± 44324.00  us/op
matMulUsingRanges       1000  avgt   20  2114802.81 ± 48825.73  us/op
matMulUsingWhileLoop    1000  avgt   20  1435843.18 ± 20688.23  us/op

Ichoran · 2019-06-11T20:29:50Z

@SethTisue - It's not as crazily bad as in the bug report, but since it seems to have actually gotten worse in 2.13 in the benchmark that @ashawley presented, and in any case it's still slow, I wouldn't make any noise.

In fact, it seems likely that there's been a ~50% performance regression in 2.13 :(

lrytz · 2019-06-27T12:27:00Z

I copied the benchmark from scala/scala#8069 out of the Scala project to a fresh sbt 0.13 project and ran it against 2.10.7, 2.11.12, 2.12.8 and 2.13.0.

@ashawley do you have that repository at hand? I'd like to reproduce your test.

I took your PR (scala/scala#8069), rebased it on current 2.13.x, and ran bench/jmh:run scala.ArrayBenchmark in sbt. What I got looks pretty good (on my MacBook Pro, and in summer heat, jdk 1.8.0_181):

[info] ArrayBenchmark.matMulUsingWhileLoopBench          10  avgt   20        0.603 ±      0.022  us/op
[info] ArrayBenchmark.matMulUsingIteratorsBench          10  avgt   20        0.851 ±      0.041  us/op
[info] ArrayBenchmark.matMulUsingRangesBench             10  avgt   20        0.946 ±      0.019  us/op
[info] ArrayBenchmark.matMulUsingWhileLoopBench         100  avgt   20      750.031 ±      7.678  us/op
[info] ArrayBenchmark.matMulUsingIteratorsBench         100  avgt   20      861.454 ±     26.186  us/op
[info] ArrayBenchmark.matMulUsingRangesBench            100  avgt   20     1172.940 ±     43.039  us/op
[info] ArrayBenchmark.matMulUsingWhileLoopBench        1000  avgt   20  1025865.092 ±   9769.750  us/op
[info] ArrayBenchmark.matMulUsingIteratorsBench        1000  avgt   20  1141386.009 ±  44863.893  us/op
[info] ArrayBenchmark.matMulUsingRangesBench           1000  avgt   20  1202351.806 ± 112234.083  us/op

Note that the bench project in the scala/scala repo enables the inliner/optimizer (https://github.com/scala/scala/blob/v2.13.0/build.sbt#L634).

ashawley · 2019-06-27T12:58:28Z

Here's a repo that can run the benchmark across Scala versions: https://github.com/ashawley/bug-1338

lrytz · 2019-06-28T09:07:48Z

Thanks, @ashawley!

I ran the benchmark on 2.12.8 and 2.13.0, on AdoptOpenJDK 1.8.212-04, on my MacBook Pro. For me the performance is very similar, maybe with a slight advantage for 2.13

2.12.8

[info] ArrayBenchmark.matMulUsingWhileLoopBench          10  avgt   20        0.608 ±      0.059  us/op
[info] ArrayBenchmark.matMulUsingIteratorsBench          10  avgt   20        4.642 ±      0.090  us/op
[info] ArrayBenchmark.matMulUsingRangesBench             10  avgt   20        1.815 ±      0.203  us/op
[info] ArrayBenchmark.matMulUsingWhileLoopBench         100  avgt   20      789.694 ±     35.496  us/op
[info] ArrayBenchmark.matMulUsingIteratorsBench         100  avgt   20     1665.028 ±    100.906  us/op
[info] ArrayBenchmark.matMulUsingRangesBench            100  avgt   20     1218.045 ±    104.188  us/op
[info] ArrayBenchmark.matMulUsingWhileLoopBench        1000  avgt   20  1178862.700 ± 153543.215  us/op
[info] ArrayBenchmark.matMulUsingIteratorsBench        1000  avgt   20  1482162.761 ± 198875.330  us/op
[info] ArrayBenchmark.matMulUsingRangesBench           1000  avgt   20  1464256.812 ± 236343.203  us/op

2.13.0

[info] ArrayBenchmark.matMulUsingWhileLoopBench          10  avgt   20        0.555 ±      0.003  us/op
[info] ArrayBenchmark.matMulUsingIteratorsBench          10  avgt   20        4.095 ±      0.460  us/op
[info] ArrayBenchmark.matMulUsingRangesBench             10  avgt   20        1.656 ±      0.027  us/op
[info] ArrayBenchmark.matMulUsingWhileLoopBench         100  avgt   20      774.364 ±     41.958  us/op
[info] ArrayBenchmark.matMulUsingIteratorsBench         100  avgt   20     1563.868 ±     36.727  us/op
[info] ArrayBenchmark.matMulUsingRangesBench            100  avgt   20     1297.571 ±    253.275  us/op
[info] ArrayBenchmark.matMulUsingWhileLoopBench        1000  avgt   20  1090820.285 ±  27666.503  us/op
[info] ArrayBenchmark.matMulUsingIteratorsBench        1000  avgt   20  1552164.276 ± 220824.486  us/op
[info] ArrayBenchmark.matMulUsingRangesBench           1000  avgt   20  1168316.835 ±  21196.048  us/op

I also ran it on 2.13 with the optimizer enabled (> set scalacOptions ++= Seq("-feature", "-opt:l:inline", "-opt-inline-from:**"), which then the Iterator and RangeForeach versions perform the same as the while loop

2.13.0, optimized

[info] ArrayBenchmark.matMulUsingWhileLoopBench          10  avgt   20        0.579 ±     0.047  us/op
[info] ArrayBenchmark.matMulUsingIteratorsBench          10  avgt   20        0.770 ±     0.053  us/op
[info] ArrayBenchmark.matMulUsingRangesBench             10  avgt   20        0.743 ±     0.016  us/op
[info] ArrayBenchmark.matMulUsingWhileLoopBench         100  avgt   20      777.171 ±    33.312  us/op
[info] ArrayBenchmark.matMulUsingIteratorsBench         100  avgt   20      900.683 ±   122.898  us/op
[info] ArrayBenchmark.matMulUsingRangesBench            100  avgt   20     1018.408 ±    39.549  us/op
[info] ArrayBenchmark.matMulUsingWhileLoopBench        1000  avgt   20  1108987.279 ± 53725.845  us/op
[info] ArrayBenchmark.matMulUsingIteratorsBench        1000  avgt   20  1053506.326 ± 36964.233  us/op
[info] ArrayBenchmark.matMulUsingRangesBench           1000  avgt   20  1075920.539 ± 25893.733  us/op

Maybe you can give it another try? Or someone else?

Out of curiosity, I ran the 2.13 version, without the scala optimzier, on Graal EE

[info] ArrayBenchmark.matMulUsingWhileLoopBench          10  avgt   20        0.580 ±      0.019  us/op
[info] ArrayBenchmark.matMulUsingIteratorsBench          10  avgt   20        0.864 ±      0.014  us/op
[info] ArrayBenchmark.matMulUsingRangesBench             10  avgt   20        0.864 ±      0.022  us/op
[info] ArrayBenchmark.matMulUsingWhileLoopBench         100  avgt   20      882.833 ±     53.340  us/op
[info] ArrayBenchmark.matMulUsingIteratorsBench         100  avgt   20     1004.923 ±     63.560  us/op
[info] ArrayBenchmark.matMulUsingRangesBench            100  avgt   20     1061.473 ±    136.659  us/op
[info] ArrayBenchmark.matMulUsingWhileLoopBench        1000  avgt   20  1191383.633 ± 141781.733  us/op
[info] ArrayBenchmark.matMulUsingIteratorsBench        1000  avgt   20  1135731.997 ± 101049.751  us/op
[info] ArrayBenchmark.matMulUsingRangesBench           1000  avgt   20  1228080.904 ± 167866.786  us/op

So Graal achieves almost the same as the Scala optimizer. This brings the discussion to the VM. The original bug report is > 10 years old, I guess it measured Scala 2.7 on Java 1.5 or so. Things have improved in the meantime, the same 2.7 code might run a lot faster on a current JVM.

It's also unclear if the measurements in the orignal bug report were done after proper warm-up.

One problem with micro-benchmarks like these is that they don't test for what's called "profile pollution". In a real application, Range.foreach is potentially invoked a large number of times, which makes the lambda invocation in its body megamorphic. It looks like the benchmark here has enough Range.foreach calls. Adding a few more Range.foreach calls with different lambdas to the benchmark doesn't seem to affect the relative performance between the while and the foreach version, so that's a good sign.

So I'd say we can close this bug.

ashawley · 2019-06-28T15:12:46Z

Thanks for confirming the benchmark and sharing your insights, @lrytz. I think this can be closed. I also predict 2.13 is fine. The inconsistent results it gave can be chalked up to benchmark variance, IMO.

lrytz · 2019-07-01T09:53:27Z

👍 Thanks for pulling on that thread, @ashawley!

scabug added the enhancement label Apr 6, 2017

scabug added this to the Backlog milestone Apr 6, 2017

SethTisue added the performance label Apr 17, 2018

jozefbakus mentioned this issue Feb 13, 2019

Feature 148 comparison job reporting nested differences AbsaOSS/enceladus#163

Merged

ashawley mentioned this issue May 17, 2019

Add benchmark of for-loops scala/scala#8069

Closed

SethTisue mentioned this issue Jun 11, 2019

For loops with -optimize have much worse performance than while loops #4633

Closed

lrytz closed this as completed Jul 1, 2019

SethTisue modified the milestones: Backlog, 2.12.8 Jul 3, 2019

szymonm mentioned this issue Jul 31, 2020

Upgrade to Scala 2.12 filodb/FiloDB#457

Open

jpsacha mentioned this issue Feb 6, 2021

Benchmark conversion of color and multichannel images ij-plugins/ijp-javacv#3

Closed

augi mentioned this issue Jul 14, 2021

feat: parsing and printing optimized avast/scala-hashes#26

Merged

SethTisue mentioned this issue Jun 20, 2022

automatically optimise loops over Range lampepfl/dotty-feature-requests#308

Closed

ekrich mentioned this issue Jan 13, 2024

Fix #3657: Remove a number of java.net defects scala-native/scala-native#3666

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize simple for loops #1338

Optimize simple for loops #1338

scabug commented Sep 15, 2008

scabug commented Sep 15, 2008

scabug commented Sep 15, 2008

scabug commented Apr 10, 2009

scabug commented Apr 10, 2009

scabug commented Apr 10, 2009

scabug commented May 15, 2009

scabug commented Oct 22, 2009

scabug commented Nov 30, 2009

scabug commented Dec 20, 2009

scabug commented Dec 26, 2009

scabug commented Jul 5, 2010

scabug commented Jul 6, 2010

scabug commented Sep 18, 2010

scabug commented Oct 3, 2010

scabug commented Oct 4, 2010

scabug commented Oct 6, 2010

scabug commented Nov 15, 2012

scabug commented Jul 21, 2013

scabug commented Jul 21, 2013

scabug commented Mar 28, 2014

SethTisue commented Jun 11, 2019 •

edited

ashawley commented Jun 11, 2019 •

edited

Ichoran commented Jun 11, 2019

lrytz commented Jun 27, 2019

ashawley commented Jun 27, 2019

lrytz commented Jun 28, 2019

ashawley commented Jun 28, 2019

lrytz commented Jul 1, 2019

Optimize simple for loops #1338

Optimize simple for loops #1338

Comments

scabug commented Sep 15, 2008

scabug commented Sep 15, 2008

scabug commented Sep 15, 2008

scabug commented Apr 10, 2009

scabug commented Apr 10, 2009

scabug commented Apr 10, 2009

scabug commented May 15, 2009

scabug commented Oct 22, 2009

scabug commented Nov 30, 2009

scabug commented Dec 20, 2009

scabug commented Dec 26, 2009

scabug commented Jul 5, 2010

scabug commented Jul 6, 2010

scabug commented Sep 18, 2010

scabug commented Oct 3, 2010

Looking forward to seeing something like this mainstream :-) Cheers

scabug commented Oct 4, 2010

scabug commented Oct 6, 2010

scabug commented Nov 15, 2012

scabug commented Jul 21, 2013

scabug commented Jul 21, 2013

scabug commented Mar 28, 2014

SethTisue commented Jun 11, 2019 • edited

ashawley commented Jun 11, 2019 • edited

Ichoran commented Jun 11, 2019

lrytz commented Jun 27, 2019

ashawley commented Jun 27, 2019

lrytz commented Jun 28, 2019

ashawley commented Jun 28, 2019

lrytz commented Jul 1, 2019

Looking forward to seeing something like this mainstream :-)
Cheers

SethTisue commented Jun 11, 2019 •

edited

ashawley commented Jun 11, 2019 •

edited