New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize simple for loops #1338
Comments
Imported From: https://issues.scala-lang.org/browse/SI-1338?orig=1 |
@DavidBiesack said: // Scala code to compare performance of nested int loops
object MatMul {
def matMulUsingIterators (
a : Array[Array[Double]],
b : Array[Array[Double]],
c : Array[Array[Double]]) : Unit = {
val b_j = new Array[Double](b.length)
for (j <- 0 until b(0).length) {
for (k <- 0 until b.length) {
b_j(k) = b(k)(j)
}
for (i <- 0 until a.length) {
val c_i = c(i)
val a_i = a(i)
var s = 0.0d;
for (k <- 0 until b.length) {
s += a_i(k) * b_j(k)
}
c_i(j) = s
}
}
}
def matMulUsingRanges (
a : Array[Array[Double]],
b : Array[Array[Double]],
c : Array[Array[Double]]) : Unit = {
val jRange = 0 until b(0).length // p
val kRange = 0 until b.length // n
val iRange = 0 until a.length // m
val b_j = new Array[Double](b.length)
for (j <- jRange) {
for (k <- kRange) {
b_j(k) = b(k)(j)
}
for (i <- iRange) {
val c_i = c(i)
val a_i = a(i)
var s = 0.0d;
for (k <- kRange) {
s += a_i(k) * b_j(k)
}
c_i(j) = s
}
}
}
def matMulUsingLimits (
a : Array[Array[Double]],
b : Array[Array[Double]],
c : Array[Array[Double]]) : Unit = {
val b_j = new Array[Double](b.length)
val m = a.length
val p = b(0).length
val n = b.length
for (j <- 0 until p) {
for (k <- 0 until n) {
b_j(k) = b(k)(j)
}
for (i <- 0 until m) {
val c_i = c(i)
val a_i = a(i)
var s = 0.0d;
for (k <- 0 until n) {
s += a_i(k) * b_j(k)
}
c_i(j) = s
}
}
}
def matMulUsingWhileLoop (
a : Array[Array[Double]],
b : Array[Array[Double]],
c : Array[Array[Double]]) : Unit = {
val m = a.length
val p = b(0).length
val n = b.length
val b_j = new Array[Double](b.length)
var i = 0; var j = 0; var k = 0
while (j < p) {
k = 0
while (k < n) {
b_j(k) = b(k)(j)
k += 1
}
i = 0
while (i < m) {
val c_i = c(i)
val a_i = a(i)
var s = 0.0d;
k = 0
while (k < n) {
s += a_i(k) * b_j(k)
k += 1
}
c_i(j) = s
i += 1
}
j += 1
}
}
def time[R](block: => R) : (Long, R) = {
val start = System.nanoTime()
val result : R = block
val time = System.nanoTime() - start
(time, result)
}
val format = new java.text.DecimalFormat("0,000'ns'");
def report[R](label: String, result: (Long, R)) = {
println(label + " " + format.format(result._1))
}
private val FACTOR = 5
private val M = 80
private val N = 70
private val P = 60
def main(args : Array[String]) = {
for (trial <- 3 until 0 by -1) {
val factor = (if (System.getProperty("factor") != null)
Integer.parseInt(System.getProperty("factor"))
else
FACTOR)
val multiplier = if (trial == 1) factor else 1;
val m = M * multiplier
val n = N * multiplier
val p = P * multiplier
val a = new Array[Array[Double]](m,n)
val b = new Array[Array[Double]](n,p)
val c = new Array[Array[Double]](m,p)
println("\nMultiply c[" + m + "," + p + "] = a[" + m + "," + n + "] times b[" + n + "," + p + "]\n");
val whileTime = time(matMulUsingWhileLoop(a,b,c))
val iterTime = time(matMulUsingIterators(a,b,c))
report("Iterators ", iterTime)
report("Limits ", time(matMulUsingLimits(a,b,c)))
report("Ranges ", time(matMulUsingRanges(a,b,c)))
report("While Loop ", whileTime)
println("MatMul by Iterators is " + iterTime._1 / whileTime._1 + " times as slow as with while loops.")
}
}
}
|
Jonathan Shore (jshore) said: Seems that this could be solved with some pattern matching in the compiler, recognizing a range with no filtering (in the simplest case). Could then remap to a while form for the next stage. |
@Sciss said: |
Kipton Barros (kbarros) said: |
Erkki Lindpere (villane) said: It would be nice if for-comprehensions with simple filters could be optimized as well, turning for (i <- 1 to 10 if shouldProcess(i)) {} into var i = 1 while (i < 10) { if (shouldProcess(i)) { } i += 1 } And extra nice if this would work with random access sequences. |
Philippe (phdp) said: This is actually the only thing keeping me from using Scala. |
@dragos said:
|
Adam Kiezun (akiezun) said: |
Miguel Garcia (mgarcia) said: I haven't benchmarked (with and without -optimize) to see whether the current compilation scheme for "simple loops" is good enough. But in case it isn't, looks like the single place to change in the compiler is method TreeBuilder.makeFor(). According to its comment, it performs five transformations. Prepending as special case a transformation for "simple loops" would not change semantics. (Well, assuming that a local definition does not shadow the usual ones: "to" in "1 to 10", "Range", and so on) |
Anton Mellit (mellit) said: def timeit(f : () => Unit) {
val t1 = System.currentTimeMillis()
f()
val t2 = System.currentTimeMillis()
println(t2-t1)
}
def repeat(n : Int, f : Int => Unit) : Unit = {
var i = 0
while (i<n) {
f(i)
i += 1
}
}
def test0() {
var i = 0
var sum = 0
while (i < 1000000000) {
sum += i
i += 1
}
println(sum)
}
def test1() {
var sum = 0
repeat(1000000000, i => {
sum += i
})
println(sum)
}
def test2() {
var sum = 0
for(i <- 0 until 1000000000) {
sum += i
}
println(sum)
}
timeit(test0)
timeit(test1)
timeit(test2) Result is: -1243309312
467
-1243309312
504
-1243309312
11899 May be this 'repeat' is a workaround? Warning: works only with 'scala -optimise'. This is not very stable, sometimes some seemingly minor modifications, i.e. moving the code outside of the function, break it and I get 12000 for 'repeat'. |
@DavidBiesack said:
Thanks for the suggestion. I hesitate to try or rely on this for several
|
@paulp said: I very much agree with mgarcia's comment of nine months ago that TreeBuilder.makeFor already does a whole pile of tree transformations and there is no convincing reason we shouldn't add one which has this kind of impact. Failing agreement on that point, I believe we have a pressing responsibility to clean up the parsing phase and plugin architecture sufficiently that it would be possible to do this transformation with a compiler plugin. |
Olivier Chafik (olivier.chafik) said: I've just written such a compiler plugin, which you can install straight away on 2.8.0 with sbaz for testing : It's probably full of bugs, but it rewrites int-range for loops with no filters and Array[T].foreach, Array[T].map into equivalent while loops. You can have a look at the auto tests to see the supported cases : Looking forward to seeing something like this mainstream :-)
|
Olivier Chafik (olivier.chafik) said: Here's the modified code : And to run it (with ScalaCL plugin installed via sbaz: sbaz install scalacl-compiler-plugin) : DISABLE_SCALACL_PLUGIN=1 scalac nbody.scala && scala nbody With the plugin turned on, the performance of the three variants (While, Limit, Range) is the same (the first while is actually slower, I haven't investigated why). |
Olivier Chafik (olivier.chafik) said: I've just enhanced the plugin with more conversions to while loops :
Also, the conversions should now work on method references and inline lambdas the same way. Further progress and plans can be tracked at the bottom of this page : |
Herrmann Khosse (herrkhosse) said: |
@SethTisue said: |
Olivier Chafik (ochafik) said: |
@DarkDimius said: The Range example in this ticket will be compiled to while loops and get same performance. |
@lrytz do you agree we should now close this, as per scala/scala#8069 ? and if we do close it, let's make some noise about it, too! |
It would seem the noise making probably could have started 6 years ago back when 2.10 was released. I copied the benchmark from scala/scala#8069 out of the Scala project to a fresh sbt 0.13 project and ran it against 2.10.7, 2.11.12, 2.12.8 and 2.13.0. My laptop (4 year old Macbook Air) is a little quieter than when I initially did this, so the variance is a little less with these numbers. 2.10.7
2.11.12
2.12.8
2.13.0
|
@SethTisue - It's not as crazily bad as in the bug report, but since it seems to have actually gotten worse in 2.13 in the benchmark that @ashawley presented, and in any case it's still slow, I wouldn't make any noise. In fact, it seems likely that there's been a ~50% performance regression in 2.13 :( |
@ashawley do you have that repository at hand? I'd like to reproduce your test. I took your PR (scala/scala#8069), rebased it on current 2.13.x, and ran
Note that the |
Here's a repo that can run the benchmark across Scala versions: https://github.com/ashawley/bug-1338 |
Thanks, @ashawley! I ran the benchmark on 2.12.8 and 2.13.0, on AdoptOpenJDK 1.8.212-04, on my MacBook Pro. For me the performance is very similar, maybe with a slight advantage for 2.13
I also ran it on 2.13 with the optimizer enabled (
Maybe you can give it another try? Or someone else? Out of curiosity, I ran the 2.13 version, without the scala optimzier, on Graal EE
So Graal achieves almost the same as the Scala optimizer. This brings the discussion to the VM. The original bug report is > 10 years old, I guess it measured Scala 2.7 on Java 1.5 or so. Things have improved in the meantime, the same 2.7 code might run a lot faster on a current JVM. It's also unclear if the measurements in the orignal bug report were done after proper warm-up. One problem with micro-benchmarks like these is that they don't test for what's called "profile pollution". In a real application, So I'd say we can close this bug. |
Thanks for confirming the benchmark and sharing your insights, @lrytz. I think this can be closed. I also predict 2.13 is fine. The inconsistent results it gave can be chalked up to benchmark variance, IMO. |
👍 Thanks for pulling on that thread, @ashawley! |
I would like to suggest the compiler optimize the common case of for loops, that is,
to use while loops under the covers instead of. Currently, nested for loops using ranges/iterators are sometimes an order of magnitude slower than while loops. However, while loop constructs for iterating over arrays are very cumbersome, and the functional style (foreach) is also cumbersome and introduces a lot of function call overhead as well.
The following two matrix multipication implementations
are much slower than the same algorithm coded with while loops:
but the while loop code is more complex and error prone.
(Sorry, Trac appears to remove some line breaks; I
added some explicit semis but might have missed some;
I'll try attaching actual working source code)
Running this while measuring time in nanoseconds:
MatMul by Iterators is 14 times as slow as with while loops.
It does not appear that the Hotspot runtime profiling and optimization dramatically helps this performance problem
This performance problem can hurt adoption of Scala for many types of uses/applications.
The text was updated successfully, but these errors were encountered: