New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mutable LinkedHashMap/LinkedHashSet nodes' link fields should be nulled out when removed #8774
Comments
Imported From: https://issues.scala-lang.org/browse/SI-8774?orig=1
|
Alex Garthwaite (alextg) said: |
Alex Garthwaite (alextg) said: At the moment, only one set of tests is failing: pr-scala-integrate-ide However, others are also reporting issues with that test suite. Looking at scala-internal, Paolo Giarrusso is reporting a similar issue with that test today. See https://groups.google.com/forum/#!searchin/scala-internals/pr-scala-integrate-ide/scala-internals/XKJa5WffDXg/LuRgQ4KPSsoJ |
Alex Garthwaite (alextg) said: Jenkins testing nondeterminstically failed on the run/private-inline test. Running it locally shows that this happened 26% of the time. Through selective application of the changes in the request, the problem is in the nulling of the next field when nodes in a HashMap are removed. This is not do to incidental GC effects (eliminated as a possibility by running with a large heap and -verbose:gc). Instead, it is due to the fact that the next field is also used for iteration/foreach operations. My next step is to write a test that tests for this and is deterministic. After that, I will simplify the patch to the Linked collection types and rerun the tests (looping) overnight. I plan to put out the new pull-request tomorrow. Thanks to Jason Zaugg for getting pr-scala-integrate-ide to work again! |
Alex Garthwaite (alextg) said: I also tried to write a test for HashMap that would exhibit the problem deterministically. However, it ends up needing one thread doing the remove operations and another using an iterator. That ends up being just as racy as the existing private-inline.scala test. And so I dropped the test. So the good news is that patch is ready, tested, and available in a pull request. But, all of this made me concerned. Currently, mutable.LinkedHashMap has the property that if one thread starts iterating (without otherwise synchronizing with removers) over a map, then it will visit all nodes not removed during the course of the iteration. With my proposed change, an unsynchronized iterator may happen to land on a node as it is being removed and may then find the "later" field nulled out. So it may end its iteration earlier than it would have. Is this acceptable? Is there a definition of the semantics of unsynchronized iteration over a mutable collection for Scala? The Java8 JDK which made a similar change to the LinkedHashMap class like does so assuming that unsynchronized iteration is racy. But Scala may be attempting some notion of the behavior of persistent structures even for the mutable ones. |
Alex Garthwaite (alextg) said: |
@som-snytt said: I remember talk about writing class files multi-threaded, but isn't everything else single-threaded? modulo GC. |
Alex Garthwaite (alextg) said: I just posted the following response on the pull-request: lchoran: you identified the problem that causes private-inline.scala to fail. It isn't a case of concurrency. It is a case of a remove() (or -=) operation occurring in the closure to a foreach (or iterator). I added asserts to the code and got the following backtrace:
When I rewrote the foreachEntry method from (simplified):
to be:
the private-inline.scala test passes 100% of the time for 100 runs. The challenge is that one then needs to be aware of this effect of removing nodes causing link fields to change and to write loops with hasNext/next to similarly read the next value before executing the closure. This doesn't seem to be acceptable (and shows that hasNext/next is not a nice interface in some ways). The fixed code is also still open to the concurrent issue I mentioned. But it is good to understand why this broke scalac. |
@Ichoran said: |
@Ichoran said: |
Alex Garthwaite (alextg) said: Here's a summary of what happened before. Thanks especially to Ichoran, Jason, Lucas, and others for discussion and the fix to GenASM. That discussion can be found at: scala/scala#3911 The changes did fix the memory-leak/nepotism issues we hit with the garbage collector. But they also broke existing code. In the failing case, we found that the GenASM stage in scalac had a foreach loop that would remove items from the collection. In doing so, it relied on the link field being left unmodified in the removed node so that the next item could be found. See the discussion on August 19th and October 2nd for the above pull request for the specific example. Out of concern for introducing a breaking change, the collective decision was to target any fix at 2.12. As part of this, I had looked at the documentation specifying collections, iteration and concurrent modifications. What I found were statements like the following:
found at http://docs.scala-lang.org/overviews/collections/overview.html. Such a statement basically says either there are no guarantees (since the underlying implementation can change) or that it is acceptable to leak and depend on underlying implementation details. Neither is helpful when one wants to change the behavior of a collection, as we are proposing. By contrast, Java specifies a fast-fail semantics for its collections that severely limits what modifications can happen to a collection during iteration. In particular, it only allows remove on an iterator, not on other forms of iteration over a collection. It defines this as part of the specification for both collection iteration and for the remove() operator, itself. It implements this (per collection type) with state that carries between hasNext() and next() (as well as remove). It coordinates the throwinng of the ConcurrentModificationException, typically, via counters that are inc'd by modifying operations and checked during iteration of any kind. Note that this issue covers more than just concurrent modifications. It also covers modifications done by a single thread iterating over a collection and modifying it as it goes. By limiting modification to removal of the current element and disallowing other kinds of adds and removes, it tries to guarantee that if one starts iterating over a collection, one will visit all elements of that collection. Limiting the set of allowed operations, carrying state between hasNext and next operations, and supporting fast-fail semantics adds complexity and cost. But it also gives clear guidance on expected behavior and doesn't allow dependencies on underlying implementations to leak out. That it is a concern can be seen in statements like the following: So, I think the consensus was:
|
@szeiger said: |
This was fixed in scala/scala#5295 and we have scala/scala-dev#188 for the follow-up discussion. |
Several collection types have var link fields in their nodes. Examples are LinkedHashMap, and LinkedHashSet. As implemented, removal of a value from one of these structures unlinks the node from the collection but leaves that node's link fields unchanged. This can lead to a form of nepotism causing unnecessary promotion of objects in generational garbage collection and can lead to sequences of frequent full garbage collections.
How does this happen? These fields allow older (previously allocated objects) to point to younger objects in the heap. If a removed node happens to be in an older generation, the fact that it refers to younger objects causes these objects to be promoted even if they are otherwise unreachable. This then causes an instance of one of these collections (especially if used as a queue) to repeatedly promote objects that should have been collected in the young generation. The problem is not a semantic one, but instead, a limitation of incremental tracing garbage collection in which it collects part of the heap while treating another part, conservatively, as if it were reachable.
Below is a simple benchmark (LinkedHashMapTest.scala) that shows the issue. The benchmark has three phases. In the first phase, a LinkedHashMap[Int, Int] is populated. In the second phase, that collection instance (and all its nodes) is promoted by invoking System.gc(). In the final phase, nodes are repeatedly added and removed as if the data structure were a queue.
In my tests, I ran the benchmark on a build of the top-of-tree for the scala-lang repo on OpenJDK 1.7. For options, I had:
export JAVA_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+UseParallelGC"
The command to run it was:
test/partest ~/tests/linkedhashmap/LinkedHashMapTest.scala
With an unmodified tree, one gets behavior like what is reported in the following attached logs:
run-bad.log: output from partest, itself
LinkedHashMapTest-linkedhashmap.log-bad: the log in the test directory generated by partest
One can see that in phase 3, the test does repeated full GCs as the nodes of the LinkedHashMap are all promoted.
The fix is relatively straightforward: add code in the remove operations to null out the link fields. My diffs for the top-of-tree can be seen in the attached file, gif-diffs.txt. With a tree modified with these diffs, one gets behavior like what is reported in these attached logs when running the same above test:
run-good.log: output from par test
LinkedHashMapTest-linkedhashmap.log-good: the log in the test directory generated by partest
As can be seen, there are now only young GCs when the test is run with the modified tree.
Separately, I have run the full test suite that comes with the repo and all tests pass.
Aside: This all came out of the observation of occasional/unpredictable spikes in GC activity in some of our services. We also found and fixed these issues with the Java LinkedHashMap class and fixed our 1.7 JDK to also null out these link fields. We've verified that this fixes the pathologies we saw. We also found that, independently, the OpenJDK 1.8 class files for these collection classes have also been modified (by Oracle) to null out these link fields. We'd now like to do the same for the Scala types.
The text was updated successfully, but these errors were encountered: