New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major performance bottleneck in scala.collection.mutable.Builder #9823
Comments
Imported From: https://issues.scala-lang.org/browse/SI-9823?orig=1 |
@retronym said: |
@retronym said (edited on Jun 18, 2016 8:21:45 AM UTC):
|
Mulugeta Mammo (mulugeta) said: |
Mulugeta Mammo (mulugeta) said (edited on Jul 27, 2016 3:19:39 PM UTC): |
@retronym said: |
@adriaanm said: |
Mulugeta Mammo (mulugeta) said: |
The underlying performance problem in the JVM is tracked as JDK-8180450 |
optimization in netty:netty/netty#12709 |
@He-Pin Another one: netty/netty#12806 more stealthy but could impact many other use cases This could impact Cassandra as well, mentioned in the comments of the JDK issue, but cannot find any other reference really |
FYI I have built a tool to help diagnosis this issue: Although it's just at bytecode level it guide the developers to spot code pattern and usage that can make the issue to happen (assuming inlining or other JIT optimizations on TypeProfile re monomorphism won't constant fold the instanceof/checkcast ops). |
Background:
While performance profiling a genome sequencing pipeline over a Spark cluster we noticed that the hot methods for the pipeline came from Scala collection APIs; specifically the
sizeHint
methods inscala.collection.mutable.Builder
trait:A further analysis of the JITed code shows that a considerably high number of CPU cycles being spent on the instance of check; thereby introducing a bottleneck. The problem also gets exacerbated for single executors with multiple threads than for multiple executors with less number of threads. For example, running the critical stage of our pipeline with 4 executors (9 cores each) is 3x faster than running it with 1 executor (with 36 cores.) Also, important to mention here that the problem gets worse on muti-socket systems with many cores.
Root cause:
The actual root cause of the problem is a slowdown in Java’s
instanceof
operator. More specifically, a cache trashing of the one-element secondary super type cache that Hotspot uses to cache the last observed secondary super type (interfaces and classes with a class hierarchy depth greater than 7)Resolution-1:
Since the root cause of the problem is Hotspot’s implementation of secondary super type check, a permanent solution would require modifying the one-element caching. In our case, we took a rather heavy-handed approach and disabled this one-element cache and let the secondary sub type check relay exclusively on scanning _secondary_supers. This hack, which does not require any modification from Scala, removed the
coll.isInstanceOf[collection.IndexedSeqLike[_,_]]
bottleneck and speeded up the execution of the critical stage of our pipeline by a factor of 3x (for a Spark configuration with 1 executor and 36 cores – resulting in the same performance we got previously by launching multiple executors with less number of threads).This resolution requires modifying Hotspot source files and would take more time to upstream to JDK.
Resolution-2:
The second resolution is to completely avoid the instance of check in the
sizeHint
methods all together and use polymorphism instead. The simplest check we tried is to comment out the instance of check and just let every collection, even collections with “expensive” size method, use the hint. This removed the bottleneck and speeded up the critical stage of our pipeline by 2.8x factor.The other option we tried, which actually uses polymorphism, is to override the
sizeHint
methods in collections of typeIndexedSeqLike
and avoid the instance of check (and a call to theBuilder.sizeHint
methods). This also resulted in 2.8x performance gain. You may also think of other elegant solutions.While browsing the Scala source, we observed other instances where you use the same code pattern – if_instance_of_this_do_this_else_do_this – which may end up having the same problem. We recommend you to review those as well.
How to repeat the behavior:
The problem is not limited to our genome workload. For example, if you test out the example code below with multiple threads, it will bottleneck on the same instance of check at
scala.collection.mutable.Builder
. In our sample ran with 36 threads, applying either one of the fixes gave ~15x speed up in execution time (compared to the default.)If you need any additional information or further details, feel free to contact me [mailto:mulugeta.mammo@intel.com]. Also, let me know if you want me to do a PR on this.
Thanks,
Mulugeta
The text was updated successfully, but these errors were encountered: