Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation in Akka with 2.12 #10083

Closed
scabug opened this issue Nov 30, 2016 · 20 comments
Closed

Performance degradation in Akka with 2.12 #10083

scabug opened this issue Nov 30, 2016 · 20 comments

Comments

@scabug
Copy link

scabug commented Nov 30, 2016

A simple ping-pong throughput test of Akka yields 10-40% lower throughput with Scala 2.12 than 2.11. Reproducer project: https://github.com/johanandren/akka-ping-pong run with forked run from inside of sbt and ++2.12.0 followed by run to run bench on 2.12

Original Akka ticket: akka/akka#21910

Used yourkit and looked at allocation rates/gc which is a tiny bit worse but not enough to explain it (I think), nothing obvious from CPU trace profiling either (see attached yourkit snapshots, sorry about bz2 but couldn't get them small enought to attach with zip).

@scabug
Copy link
Author

scabug commented Nov 30, 2016

Imported From: https://issues.scala-lang.org/browse/SI-10083?orig=1
Reporter: Johan Andrén (johanandren)
Affected Versions: 2.12.0
See #10009
Attachments:

@scabug
Copy link
Author

scabug commented Nov 30, 2016

@retronym said:
We recently found a performance problem with groupBy which turned out to be caused by an extra level of indirection introduced by the 2.12.x trait encoding.

I notice some "inlining too deep" notices in the -XX:+PrintInlining output under 2.12 for this test:

                              @ 148   akka.actor.ActorCell::checkReceiveTimeout (5 bytes)   inline (hot)
                                @ 1   akka.actor.dungeon.ReceiveTimeout::checkReceiveTimeout$ (5 bytes)   inline (hot)
                                  @ 1   akka.actor.dungeon.ReceiveTimeout::checkReceiveTimeout (183 bytes)   inline (hot)
                                    @ 1   akka.actor.ActorCell::akka$actor$dungeon$ReceiveTimeout$$receiveTimeoutData (5 bytes)   accessor
                                    @ 11   akka.actor.ActorCell::mailbox (5 bytes)   inline (hot)
                                      @ 1   akka.actor.dungeon.Dispatch::mailbox$ (5 bytes)   inline (hot)
                                        @ 1   akka.actor.dungeon.Dispatch::mailbox (14 bytes)   inline (hot)
                                          @ 7   sun.misc.Unsafe::getObjectVolatile (0 bytes)   (intrinsic)
                                    @ 16   akka.dispatch.Mailbox::hasMessages (10 bytes)   inline (hot)
                                      @ 1   akka.dispatch.Mailbox::messageQueue (5 bytes)   accessor
                                      @ 4   akka.dispatch.UnboundedMailbox$MessageQueue::hasMessages (5 bytes)   inline (hot)
                                      @ 4   akka.dispatch.UnboundedMailbox$MessageQueue::hasMessages (5 bytes)   inline (hot)
                                       \-> TypeProfile (8/13031 counts) = akka/event/LoggerMailbox
                                       \-> TypeProfile (13023/13031 counts) = akka/dispatch/UnboundedMailbox$MessageQueue
                                        @ 1   akka.dispatch.QueueBasedMessageQueue::hasMessages$ (5 bytes)   inline (hot)
                                          @ 1   akka.dispatch.QueueBasedMessageQueue::hasMessages (20 bytes)   inline (hot)
                                            @ 1   akka.dispatch.UnboundedMailbox$MessageQueue::queue (2 bytes)   inline (hot)
                                            @ 6   java.util.concurrent.ConcurrentLinkedQueue::isEmpty (13 bytes)   inline (hot)
                                            @ 6   java.util.concurrent.ConcurrentLinkedQueue::isEmpty (13 bytes)   inline (hot)
                                             \-> TypeProfile (8/6698 counts) = akka/event/LoggerMailbox
                                             \-> TypeProfile (6690/6698 counts) = akka/dispatch/UnboundedMailbox$MessageQueue
                                              @ 1   java.util.concurrent.ConcurrentLinkedQueue::first (65 bytes)   inline (hot)
                                                @ 38   java.util.concurrent.ConcurrentLinkedQueue::updateHead (20 bytes)   inline (hot)
                                                  @ 8   java.util.concurrent.ConcurrentLinkedQueue::casHead (13 bytes)   inlining too deep
                                                  @ 16   java.util.concurrent.ConcurrentLinkedQueue$Node::lazySetNext (12 bytes)   inlining too deep
                                              @ 1   java.util.concurrent.ConcurrentLinkedQueue::first (65 bytes)   inline (hot)
                                                @ 38   java.util.concurrent.ConcurrentLinkedQueue::updateHead (20 bytes)   inline (hot)
                                                  @ 8   java.util.concurrent.ConcurrentLinkedQueue::casHead (13 bytes)   inlining too deep
                                                  @ 16   java.util.concurrent.ConcurrentLinkedQueue$Node::lazySetNext (12 bytes)   inlining too deep
                                        @ 1   akka.dispatch.QueueBasedMessageQueue::hasMessages$ (5 bytes)   inline (hot)
                                          @ 1   akka.dispatch.QueueBasedMessageQueue::hasMessages (20 bytes)   inline (hot)
                                            @ 1   akka.dispatch.UnboundedMailbox$MessageQueue::queue (2 bytes)   inline (hot)
                                            @ 6   java.util.concurrent.ConcurrentLinkedQueue::isEmpty (13 bytes)   inline (hot)
                                            @ 6   java.util.concurrent.ConcurrentLinkedQueue::isEmpty (13 bytes)   inline (hot)
                                             \-> TypeProfile (8/6698 counts) = akka/event/LoggerMailbox
                                             \-> TypeProfile (6690/6698 counts) = akka/dispatch/UnboundedMailbox$MessageQueue
                                              @ 1   java.util.concurrent.ConcurrentLinkedQueue::first (65 bytes)   inline (hot)
                                                @ 38   java.util.concurrent.ConcurrentLinkedQueue::updateHead (20 bytes)   inline (hot)
                                                  @ 8   java.util.concurrent.ConcurrentLinkedQueue::casHead (13 bytes)   inlining too deep
                                                  @ 16   java.util.concurrent.ConcurrentLinkedQueue$Node::lazySetNext (12 bytes)   inlining too deep
                                              @ 1   java.util.concurrent.ConcurrentLinkedQueue::first (65 bytes)   inline (hot)
                                                @ 38   java.util.concurrent.ConcurrentLinkedQueue::updateHead (20 bytes)   inline (hot)
                                                  @ 8   java.util.concurrent.ConcurrentLinkedQueue::casHead (13 bytes)   inlining too deep
                                                  @ 16   java.util.concurrent.ConcurrentLinkedQueue$Node::lazySetNext (12 bytes)   inlining too deep

@scabug
Copy link
Author

scabug commented Nov 30, 2016

@retronym said:
See: scala/scala#5528 / #10009.

In that case, we refactored the hot code to avoid depending on JVM inlining so heavily.

However, in your case I'm not able to improve performance with -XX:MaxInlineLevel= set to 12 or 15, so you might have a different problem altogether.

@scabug
Copy link
Author

scabug commented Nov 30, 2016

@retronym said:
I wonder if we could exhibit the same slowdown with fewer (even 1) threads in play. That might help us focus in on the problem.

Is there some sort of simpler/smaller dispatcher we could configure?

@scabug
Copy link
Author

scabug commented Nov 30, 2016

Johan Andrén (johanandren) said:
I'll try it with fewer threads in the dispatcher.

@scabug
Copy link
Author

scabug commented Nov 30, 2016

Johan Andrén (johanandren) said:
Pushed sample project with a single threaded thread pool based dispatcher.

@scabug
Copy link
Author

scabug commented Nov 30, 2016

Johan Andrén (johanandren) said:
After building an Akka artifact for 2.12 with -Xmixin-force-forwarders:true a single threaded dispatcher seemed to be on par for 2.11 and 2.12 but going back to the default dispatcher with more threads 2.12 still lands at ~45% slower.

@scabug
Copy link
Author

scabug commented Nov 30, 2016

@SethTisue said:
You can test it with a Scala nightly from the 2.12.x branch as follows: http://stackoverflow.com/a/40622879/86485

@scabug
Copy link
Author

scabug commented Nov 30, 2016

@lrytz said:
-Xmixin-force-forwarders:true is the default on 2.12.0.

@scabug
Copy link
Author

scabug commented Dec 1, 2016

Johan Andrén (johanandren) said:
Tried building/running with 2.12.1-c2eb299-nightly, no improvement.

@scabug
Copy link
Author

scabug commented Dec 1, 2016

@retronym said (edited on Dec 1, 2016 11:13:03 AM UTC):
Am I correct in saying there are no external Scala dependencies needed to build the core of Akka and run this benchmark? If so, we could try building the core of Akka and this benchmark across a 2.12.0-M1, M2, etc to see when things slowed down. Maybe we'll even be able to bisect to a particular commit.

One problem might be if milestones of of Scala 2.12 had bugs that prevented Akka from building/working.

@scabug
Copy link
Author

scabug commented Dec 2, 2016

Johan Andrén (johanandren) said (edited on Dec 2, 2016 12:07:27 PM UTC):
Yes, the only dependency is typesafe-config but that's Java-only. I'll try that out next (likely monday).

@scabug
Copy link
Author

scabug commented Dec 5, 2016

@retronym said (edited on Dec 5, 2016 1:34:52 AM UTC):
I've embedded the Akka sources in the benchmark in johanandren/akka-ping-pong@master...retronym:embed-akka

@scabug
Copy link
Author

scabug commented Dec 5, 2016

@retronym said (edited on Dec 5, 2016 3:33:51 AM UTC):
I bisected the regression (or maybe, the dominant regression) to our removal of scala.concurrent.forkjoin in favour of direct use of the JDK implementation.

scala/scala@fa8012d

In johanandren/akka-ping-pong@master...retronym:embed-akka, I changed Akka dispatcher and mailbox code to j.u.c.ForkJoin directly. This also shows the regression with 2.11.8.

Prior to removing it, our fork of ForkJoin was at:

scala/scala@77437ff

Updates ForkJoinPool and dependent classes to the latest jsr166y revisions:

    ForkJoinPool.java:
    Revision 1.185
    Sat Feb 16 20:50:29 2013 UTC (2 months, 2 weeks ago) by jsr166

    ForkJoinTask.java:
    Revision 1.100
    Tue Feb 5 17:09:54 2013 UTC (3 months ago) by jsr166

    ForkJoinWorkerThread.java:
    Revision 1.73
    Wed Nov 21 19:54:39 2012 UTC (5 months, 2 weeks ago) by dl

@scabug
Copy link
Author

scabug commented Dec 5, 2016

@retronym said:
Once we figure out the culprit, we should check if we Scala futures have are affected, too.

@scabug
Copy link
Author

scabug commented Dec 5, 2016

Johan Andrén (johanandren) said:
Not sure if this is useful, but a user in the Akka community wanted to do the same in a patch version of Akka but ended up with the observation that Java 8 and Scala 2.11 includes different versions of JSR166 and we then closed the ticket:
akka/akka#21101 (comment)

@scabug
Copy link
Author

scabug commented Dec 5, 2016

Johan Andrén (johanandren) said:
Seems the version of JSR166 FJP in 2.11 is 1.185 while Java 8 includes something I can't quite figure out what revision it is, comparing the 2.11 branch one with the JDK8u-branch the versions are very different (http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/8b04ee324a1a/src/share/classes/java/util/concurrent/ForkJoinPool.java)

@scabug
Copy link
Author

scabug commented Dec 5, 2016

@retronym said:
Heres the revision log of ForkJoinPool in the JDK8 updates branch:

http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/log/8b04ee324a1a/src/share/classes/java/util/concurrent/ForkJoinPool.java

I think the last commit, discussed as https://bugs.openjdk.java.net/browse/JDK-8080623, might be relevant. As far as I can tell the benchmark only has two threads worth of work to do (one ping and one pong actor), so the 100% cpu utilization on machines with >2 cores was busy waiting. Throughput was certainly improved with this tactic, but:


Unfortunately the spinning has shown to be fairly CPU intensive causing overall performance regressions in applications due to less CPU available for the other parts of the application not running as FJ tasks. 

The recommendation which has been discussed with Doug Lea is to disable the spinning for JDK 8, and for JDK 9 continue the ongoing Work and research to solve the active waiting without spinning. Depending on timing of this work it may potentially be back ported to a later release of 8u.

@scabug
Copy link
Author

scabug commented Dec 27, 2016

@retronym said:
Here's a related JEP that would might let a future version of the standard ForkJoin pool re-enable busy waiting: http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-October/035613.html

@scabug
Copy link
Author

scabug commented Apr 4, 2017

@SethTisue said:
This has been addressed on Akka's end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant