Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mutable LinkedHashMap/LinkedHashSet nodes' link fields should be nulled out when removed #8774

Closed
scabug opened this issue Aug 2, 2014 · 13 comments
Assignees
Milestone

Comments

@scabug
Copy link

scabug commented Aug 2, 2014

Several collection types have var link fields in their nodes. Examples are LinkedHashMap, and LinkedHashSet. As implemented, removal of a value from one of these structures unlinks the node from the collection but leaves that node's link fields unchanged. This can lead to a form of nepotism causing unnecessary promotion of objects in generational garbage collection and can lead to sequences of frequent full garbage collections.

How does this happen? These fields allow older (previously allocated objects) to point to younger objects in the heap. If a removed node happens to be in an older generation, the fact that it refers to younger objects causes these objects to be promoted even if they are otherwise unreachable. This then causes an instance of one of these collections (especially if used as a queue) to repeatedly promote objects that should have been collected in the young generation. The problem is not a semantic one, but instead, a limitation of incremental tracing garbage collection in which it collects part of the heap while treating another part, conservatively, as if it were reachable.

Below is a simple benchmark (LinkedHashMapTest.scala) that shows the issue. The benchmark has three phases. In the first phase, a LinkedHashMap[Int, Int] is populated. In the second phase, that collection instance (and all its nodes) is promoted by invoking System.gc(). In the final phase, nodes are repeatedly added and removed as if the data structure were a queue.

In my tests, I ran the benchmark on a build of the top-of-tree for the scala-lang repo on OpenJDK 1.7. For options, I had:
export JAVA_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+UseParallelGC"
The command to run it was:
test/partest ~/tests/linkedhashmap/LinkedHashMapTest.scala
With an unmodified tree, one gets behavior like what is reported in the following attached logs:
run-bad.log: output from partest, itself
LinkedHashMapTest-linkedhashmap.log-bad: the log in the test directory generated by partest
One can see that in phase 3, the test does repeated full GCs as the nodes of the LinkedHashMap are all promoted.

The fix is relatively straightforward: add code in the remove operations to null out the link fields. My diffs for the top-of-tree can be seen in the attached file, gif-diffs.txt. With a tree modified with these diffs, one gets behavior like what is reported in these attached logs when running the same above test:
run-good.log: output from par test
LinkedHashMapTest-linkedhashmap.log-good: the log in the test directory generated by partest
As can be seen, there are now only young GCs when the test is run with the modified tree.

Separately, I have run the full test suite that comes with the repo and all tests pass.

Aside: This all came out of the observation of occasional/unpredictable spikes in GC activity in some of our services. We also found and fixed these issues with the Java LinkedHashMap class and fixed our 1.7 JDK to also null out these link fields. We've verified that this fixes the pathologies we saw. We also found that, independently, the OpenJDK 1.8 class files for these collection classes have also been modified (by Oracle) to null out these link fields. We'd now like to do the same for the Scala types.

@scabug
Copy link
Author

scabug commented Aug 2, 2014

Imported From: https://issues.scala-lang.org/browse/SI-8774?orig=1
Reporter: Alex Garthwaite (alextg)
Affected Versions: 2.10.4, 2.11.1
Attachments:

@scabug
Copy link
Author

scabug commented Aug 2, 2014

Alex Garthwaite (alextg) said:
I am testing the above patch against the 2.10.x branch and will post a review-request tomorrow.

@scabug
Copy link
Author

scabug commented Aug 5, 2014

Alex Garthwaite (alextg) said:
The pull request is here: scala/scala#3911

At the moment, only one set of tests is failing: pr-scala-integrate-ide

However, others are also reporting issues with that test suite. Looking at scala-internal, Paolo Giarrusso is reporting a similar issue with that test today. See https://groups.google.com/forum/#!searchin/scala-internals/pr-scala-integrate-ide/scala-internals/XKJa5WffDXg/LuRgQ4KPSsoJ

@scabug
Copy link
Author

scabug commented Aug 7, 2014

Alex Garthwaite (alextg) said:
I'm redoing the pull-request.

Jenkins testing nondeterminstically failed on the run/private-inline test. Running it locally shows that this happened 26% of the time. Through selective application of the changes in the request, the problem is in the nulling of the next field when nodes in a HashMap are removed. This is not do to incidental GC effects (eliminated as a possibility by running with a large heap and -verbose:gc). Instead, it is due to the fact that the next field is also used for iteration/foreach operations.

My next step is to write a test that tests for this and is deterministic. After that, I will simplify the patch to the Linked collection types and rerun the tests (looping) overnight. I plan to put out the new pull-request tomorrow.

Thanks to Jason Zaugg for getting pr-scala-integrate-ide to work again!

@scabug
Copy link
Author

scabug commented Aug 15, 2014

Alex Garthwaite (alextg) said:
I got sidetracked but finally created a single commit that just addresses the GC issue in LinkedHashMap and LinkedHashSet (since these, with their doubly-linked lists, exhibit the problem most clearly). The resulting patch passes all tests.

I also tried to write a test for HashMap that would exhibit the problem deterministically. However, it ends up needing one thread doing the remove operations and another using an iterator. That ends up being just as racy as the existing private-inline.scala test. And so I dropped the test.

So the good news is that patch is ready, tested, and available in a pull request.

But, all of this made me concerned. Currently, mutable.LinkedHashMap has the property that if one thread starts iterating (without otherwise synchronizing with removers) over a map, then it will visit all nodes not removed during the course of the iteration. With my proposed change, an unsynchronized iterator may happen to land on a node as it is being removed and may then find the "later" field nulled out. So it may end its iteration earlier than it would have. Is this acceptable? Is there a definition of the semantics of unsynchronized iteration over a mutable collection for Scala?

The Java8 JDK which made a similar change to the LinkedHashMap class like does so assuming that unsynchronized iteration is racy. But Scala may be attempting some notion of the behavior of persistent structures even for the mutable ones.

@scabug
Copy link
Author

scabug commented Aug 18, 2014

Alex Garthwaite (alextg) said:
Could someone from collections look at my last post and respond? Thanks!

@scabug
Copy link
Author

scabug commented Aug 18, 2014

@som-snytt said:
I'm not sure the non-determinism in the private-inline test is due to a race condition. It could be due just to traversal order.

I remember talk about writing class files multi-threaded, but isn't everything else single-threaded? modulo GC.

@scabug
Copy link
Author

scabug commented Aug 19, 2014

Alex Garthwaite (alextg) said:
You are correct: the issue I hit in scalac had nothing to do with concurrency. lchoran had the right insight.

I just posted the following response on the pull-request:

lchoran: you identified the problem that causes private-inline.scala to fail. It isn't a case of concurrency. It is a case of a remove() (or -=) operation occurring in the closure to a foreach (or iterator). I added asserts to the code and got the following backtrace:

Possible compiler crash during test of: test/files/run/private-inline.scala
java.lang.AssertionError: assertion failed
	at scala.Predef$.assert(Predef.scala:165)
	at scala.collection.mutable.HashMap.$minus$eq(HashMap.scala:97)
	at scala.tools.nsc.backend.jvm.GenASM$AsmPhase$$anonfun$run$3.apply(GenASM.scala:90)
	at scala.tools.nsc.backend.jvm.GenASM$AsmPhase$$anonfun$run$3.apply(GenASM.scala:87)
	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
	at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:101)
	at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:101)
	at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:234)
	at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
	at scala.collection.mutable.HashMap.foreach(HashMap.scala:101)
	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
	at scala.tools.nsc.backend.jvm.GenASM$AsmPhase.run(GenASM.scala:87)
	at scala.tools.nsc.Global$Run.compileUnitsInternal(Global.scala:1583)
	at scala.tools.nsc.Global$Run.compileUnits(Global.scala:1557)
	at scala.tools.nsc.Global$Run.compileSources(Global.scala:1553)
	at scala.tools.nsc.Global$Run.compile(Global.scala:1662)

When I rewrote the foreachEntry method from (simplified):

         while (es != null) {
               f(...)
               es = es.next
          }

to be:

         while (es != null) {
               val next1 = es.next
               f(...)
               val next = es.next
               es = if (next ne null) next else next1
          }

the private-inline.scala test passes 100% of the time for 100 runs.

The challenge is that one then needs to be aware of this effect of removing nodes causing link fields to change and to write loops with hasNext/next to similarly read the next value before executing the closure. This doesn't seem to be acceptable (and shows that hasNext/next is not a nice interface in some ways).

The fixed code is also still open to the concurrent issue I mentioned. But it is good to understand why this broke scalac.

@scabug
Copy link
Author

scabug commented Mar 29, 2015

@Ichoran said:
Note to self--check if this ever actually got fixed?

@scabug
Copy link
Author

scabug commented Mar 30, 2015

@Ichoran said:
Nope, never actually got fixed (because it's deceptively complex, I guess).

@scabug
Copy link
Author

scabug commented Mar 31, 2015

Alex Garthwaite (alextg) said:
Rex, thanks for taking this on.

Here's a summary of what happened before. Thanks especially to Ichoran, Jason, Lucas, and others for discussion and the fix to GenASM. That discussion can be found at: scala/scala#3911

The changes did fix the memory-leak/nepotism issues we hit with the garbage collector. But they also broke existing code. In the failing case, we found that the GenASM stage in scalac had a foreach loop that would remove items from the collection. In doing so, it relied on the link field being left unmodified in the removed node so that the next item could be found. See the discussion on August 19th and October 2nd for the above pull request for the specific example. Out of concern for introducing a breaking change, the collective decision was to target any fix at 2.12.

As part of this, I had looked at the documentation specifying collections, iteration and concurrent modifications. What I found were statements like the following:

A collection in package scala.collection.mutable is known to have some operations that change the collection in place. So dealing with mutable collection means you need to understand which code changes which collection when.

found at http://docs.scala-lang.org/overviews/collections/overview.html. Such a statement basically says either there are no guarantees (since the underlying implementation can change) or that it is acceptable to leak and depend on underlying implementation details. Neither is helpful when one wants to change the behavior of a collection, as we are proposing.

By contrast, Java specifies a fast-fail semantics for its collections that severely limits what modifications can happen to a collection during iteration. In particular, it only allows remove on an iterator, not on other forms of iteration over a collection. It defines this as part of the specification for both collection iteration and for the remove() operator, itself. It implements this (per collection type) with state that carries between hasNext() and next() (as well as remove). It coordinates the throwinng of the ConcurrentModificationException, typically, via counters that are inc'd by modifying operations and checked during iteration of any kind.

Note that this issue covers more than just concurrent modifications. It also covers modifications done by a single thread iterating over a collection and modifying it as it goes. By limiting modification to removal of the current element and disallowing other kinds of adds and removes, it tries to guarantee that if one starts iterating over a collection, one will visit all elements of that collection.

Limiting the set of allowed operations, carrying state between hasNext and next operations, and supporting fast-fail semantics adds complexity and cost. But it also gives clear guidance on expected behavior and doesn't allow dependencies on underlying implementations to leak out. That it is a concern can be seen in statements like the following:
http://stackoverflow.com/questions/26249635/scala-mutablelist-when-foreach-add-element-why-not-throw-exception/26249934#26249934

So, I think the consensus was:

  • fix the memory-leak in 2.12 since it will (likely) break existing code in the field
  • consider whether to introduce explicit fail-fast support in the (mutable) collections, and
  • take this as an opportunity to better specify the semantics of iteration and foreach operations on (mutable) collections so that it is clearer both what the behavior of collections are and to facilitate future improvements in these classes without breaking code in the field.

@scabug
Copy link
Author

scabug commented Jul 22, 2016

@szeiger said:
Fix for 2.12: scala/scala#5295
Rescheduling to 2.13 for further consideration.

@scabug scabug added this to the 2.13.0-M1 milestone Apr 7, 2017
@adriaanm adriaanm modified the milestones: 2.13.0-M1, 2.13.0-M2 Apr 14, 2017
@szeiger szeiger modified the milestones: 2.13.0-M3, 2.13.0-M2 Jul 20, 2017
@adriaanm adriaanm modified the milestones: 2.13.0-M3, 2.13.0-M4 Jan 30, 2018
@lrytz
Copy link
Member

lrytz commented Apr 23, 2018

This was fixed in scala/scala#5295 and we have scala/scala-dev#188 for the follow-up discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants