Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major performance bottleneck in scala.collection.mutable.Builder #9823

Closed
scabug opened this issue Jun 17, 2016 · 12 comments
Closed

Major performance bottleneck in scala.collection.mutable.Builder #9823

scabug opened this issue Jun 17, 2016 · 12 comments

Comments

@scabug
Copy link

scabug commented Jun 17, 2016

Background:

While performance profiling a genome sequencing pipeline over a Spark cluster we noticed that the hot methods for the pipeline came from Scala collection APIs; specifically the sizeHint methods in scala.collection.mutable.Builder trait:

  // [https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/mutable/Builder.scala#L77]
  def sizeHint(coll: TraversableLike[_, _]) {
    if (coll.isInstanceOf[collection.IndexedSeqLike[_,_]]) {
      sizeHint(coll.size)
    }
  }
  …

A further analysis of the JITed code shows that a considerably high number of CPU cycles being spent on the instance of check; thereby introducing a bottleneck. The problem also gets exacerbated for single executors with multiple threads than for multiple executors with less number of threads. For example, running the critical stage of our pipeline with 4 executors (9 cores each) is 3x faster than running it with 1 executor (with 36 cores.) Also, important to mention here that the problem gets worse on muti-socket systems with many cores.

Root cause:

The actual root cause of the problem is a slowdown in Java’s instanceof operator. More specifically, a cache trashing of the one-element secondary super type cache that Hotspot uses to cache the last observed secondary super type (interfaces and classes with a class hierarchy depth greater than 7)

// [http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/87ee5ee27509/src/share/vm/oops/klass.hpp]
class Klass : public Metadata { 
  …
  // Cache of last observed secondary supertype
  Klass*      _secondary_super_cache;
  
  // Array of all secondary supertypes
  Array<Klass*>* _secondary_supers; 
  …
} 

Resolution-1:

Since the root cause of the problem is Hotspot’s implementation of secondary super type check, a permanent solution would require modifying the one-element caching. In our case, we took a rather heavy-handed approach and disabled this one-element cache and let the secondary sub type check relay exclusively on scanning _secondary_supers. This hack, which does not require any modification from Scala, removed the coll.isInstanceOf[collection.IndexedSeqLike[_,_]] bottleneck and speeded up the execution of the critical stage of our pipeline by a factor of 3x (for a Spark configuration with 1 executor and 36 cores – resulting in the same performance we got previously by launching multiple executors with less number of threads).

This resolution requires modifying Hotspot source files and would take more time to upstream to JDK.

Resolution-2:

The second resolution is to completely avoid the instance of check in the sizeHint methods all together and use polymorphism instead. The simplest check we tried is to comment out the instance of check and just let every collection, even collections with “expensive” size method, use the hint. This removed the bottleneck and speeded up the critical stage of our pipeline by 2.8x factor.

The other option we tried, which actually uses polymorphism, is to override the sizeHint methods in collections of type IndexedSeqLike and avoid the instance of check (and a call to the Builder.sizeHint methods). This also resulted in 2.8x performance gain. You may also think of other elegant solutions.

While browsing the Scala source, we observed other instances where you use the same code pattern – if_instance_of_this_do_this_else_do_this – which may end up having the same problem. We recommend you to review those as well.

How to repeat the behavior:

The problem is not limited to our genome workload. For example, if you test out the example code below with multiple threads, it will bottleneck on the same instance of check at scala.collection.mutable.Builder. In our sample ran with 36 threads, applying either one of the fixes gave ~15x speed up in execution time (compared to the default.)

// Running environment is critical; on 2 socket system with 18 cores/socket, HT disabled
// Use -XX:-TieredCompilation
class Task extends Runnable {
  val c1 = Array(-1, 1)
  val c2 = Array(-2, 2)
  def run() {
    for (i <- 1 to 200000000) {
      c1.zip(c2).map(_._1).toSeq.zip(1 until 3).map(_._1)
    }
  }
}

object ScalaInstanceof {
  def main(args: Array[String]) {
    val numThreads = args(0).toInt
    val es = java.util.concurrent.Executors.newFixedThreadPool(numThreads)
    for (i <- 1 to numThreads) es.submit(new Task)
    es.shutdown()
  }
}

If you need any additional information or further details, feel free to contact me [mailto:mulugeta.mammo@intel.com]. Also, let me know if you want me to do a PR on this.

Thanks,
Mulugeta

@scabug
Copy link
Author

scabug commented Jun 17, 2016

Imported From: https://issues.scala-lang.org/browse/SI-9823?orig=1
Reporter: Mulugeta Mammo (mulugeta)
Affected Versions: 2.9.0, 2.10.0, 2.11.0, 2.12.1

@scabug
Copy link
Author

scabug commented Jun 18, 2016

@retronym said:
Thank you for the amazingly detailed analysis! We'll digest this and try to reproduce your results and report back.

@scabug
Copy link
Author

scabug commented Jun 18, 2016

@retronym said (edited on Jun 18, 2016 8:21:45 AM UTC):
Here's another possible change. Not benchmarked as of yet.

diff --git a/src/library/scala/collection/IndexedSeqLike.scala b/src/library/scala/collection/IndexedSeqLike.scala
index f4bf58f..a9445a4 100644
--- a/src/library/scala/collection/IndexedSeqLike.scala
+++ b/src/library/scala/collection/IndexedSeqLike.scala
@@ -8,6 +8,7 @@
 
 package scala
 package collection
+import scala.collection.generic.CanBuildFrom
 
 /** A template trait for indexed sequences of type `IndexedSeq[A]`.
  *
@@ -86,6 +87,19 @@ trait IndexedSeqLike[+A, +Repr] extends Any with SeqLike[A, Repr] {
   override /*IterableLike*/
   def iterator: Iterator[A] = new Elements(0, length)
 
+
+  override def map[B, That](f: (A) => B)(implicit bf: CanBuildFrom[Repr, B, That]): That = {
+    def builder = { // extracted to keep method size under 35 bytes, so that it can be JIT-inlined
+      val b = bf(repr)
+      b.sizeHint(size)
+      b
+    }
+    val b = builder
+    for (x <- this) b += f(x)
+    b.result
+  }
+
+
   /* Overridden for efficiency */
   override def toBuffer[A1 >: A]: mutable.Buffer[A1] = {
     val result = new mutable.ArrayBuffer[A1](size)
diff --git a/src/library/scala/collection/TraversableLike.scala b/src/library/scala/collection/TraversableLike.scala
index d914f2e..8449e87 100644
--- a/src/library/scala/collection/TraversableLike.scala
+++ b/src/library/scala/collection/TraversableLike.scala
@@ -225,12 +225,7 @@ trait TraversableLike[+A, +Repr] extends Any
     (that ++ seq)(breakOut)
 
   def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That = {
-    def builder = { // extracted to keep method size under 35 bytes, so that it can be JIT-inlined
-      val b = bf(repr)
-      b.sizeHint(this)
-      b
-    }
-    val b = builder
+    val b = bf(repr)
     for (x <- this) b += f(x)
     b.result
   }

@scabug
Copy link
Author

scabug commented Jun 24, 2016

Mulugeta Mammo (mulugeta) said:
Yes that works as well. For the above example run it would get the CPI down to 0.56 from 9.97 for the original implementation.

@scabug
Copy link
Author

scabug commented Jul 27, 2016

Mulugeta Mammo (mulugeta) said (edited on Jul 27, 2016 3:19:39 PM UTC):
For the sample example, use the -XX:-TieredCompilation flag to avoid C1 and use C2 instead. In the genome pipeline case, the flag is enabled but Hotspot compiles a similar piece of code at C2/Level 4. Also, this is not due to code cache exhaustion.

@scabug
Copy link
Author

scabug commented Aug 19, 2016

@retronym said:
I'm trying to get a fix for this into 2.12.0: scala/scala#5351

@scabug
Copy link
Author

scabug commented Aug 30, 2016

@adriaanm said:
scala/scala#5364

@scabug scabug closed this as completed Aug 30, 2016
@scabug
Copy link
Author

scabug commented Sep 1, 2016

Mulugeta Mammo (mulugeta) said:
So that was basically preferring polymorphism over instanceof (resolution 2). Thanks.

@retronym
Copy link
Member

The underlying performance problem in the JVM is tracked as JDK-8180450

@He-Pin
Copy link
Contributor

He-Pin commented Sep 9, 2022

optimization in netty:netty/netty#12709

@franz1981
Copy link

franz1981 commented Sep 16, 2022

@He-Pin Another one: netty/netty#12806 more stealthy but could impact many other use cases

This could impact Cassandra as well, mentioned in the comments of the JDK issue, but cannot find any other reference really

@franz1981
Copy link

franz1981 commented Oct 12, 2022

@scabug @retronym

FYI I have built a tool to help diagnosis this issue:
https://github.com/franz1981/type-pollution-agent

Although it's just at bytecode level it guide the developers to spot code pattern and usage that can make the issue to happen (assuming inlining or other JIT optimizations on TypeProfile re monomorphism won't constant fold the instanceof/checkcast ops).
The idea is to use such tool to spot patterns that looks legit and resurrect the JDK issue (and get it fixed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants