New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List serialization/deserialization can overflow the stack #6961
Comments
Imported From: https://issues.scala-lang.org/browse/SI-6961?orig=1 |
@retronym said: to scala-internals https://gist.github.com/4512368 Regards, |
Jan Vanek (j3vanek) said: https://gist.github.com/4517945 There was couple of bugs in the first version, as you might have noticed. This version removes the thread-local data (set's to null) after top-level call, so no trace is left. I think it's better so. But if you think it is better to keep the data there to avoid per-top-list-allocation, then it should not be an instance of a scala class (ListSerializationCtrl), but an Array[Boolean] instance, so the hypothetical class-loader leak is prevented. The version sets inLoop = false directly after write/read in finally block, so that when an exception is thrown the data is in correct state. I first thought it is enough to just remove (set to null) the TLS data in the top-level try/finally, but this way is safer; it's possible to imagine serialization-data which could leave the TLS data in wrong state (if one does e.g serialization within serialization). This is paranoid, but I measured the cost of try/finally there (against setting inLoop = false not within finally) and it was not recognizeable, so no reason to not have that. It also supports the 2.9 serialization format data, at least according my testing. I wrote the object using oldWriteObject() and read correctly using the new readObject(). I've moved the ListSerializationCtrl class and thread-local variable out of List object. This is mainly to have it private and not private[scala] because the List.tlSerCtrl would be visible from Java in IDEs. So now ListSerializationStart/End/Ctrl are all private classes close to each other. Finally I did some performance testing. List of size S, serialized/deserialized N times into/from a byte-array: List size | repeated | 2.9.(ms) | this code(ms) | current 2.10.(ms) So this code is actually faster then the current 2.10. code. That was a pleasant surprise. I don't have definitive explanation on that. I think the code is ready, but I didn't yet have time to learn how to do pull request, or what is the next step. |
@retronym said: |
@axel22 said: |
@axel22 said (edited on Jan 15, 2013 4:27:09 PM UTC): A couple of comments:
Am I right? Nevertheless, could you add a few more tests with nested lists, and lists of objects having nested lists and shared lists to ensure that this works?
Overall, looks good. On a side note, I wonder if there is some existing serialization infrastructure, i.e. a method call, we could use instead |
Jan Vanek (j3vanek) said: ad 1) It was so in the first (or maybe also second) version in the fork. I refactored those code pieces from write/readObjectImpl into those ListSerializationCtrl.write/read/InLoop/AsSeen methods to make the code in write/readObjectImpl shorter and more readable and I hoped giving those pieces names would also make it more understandable (than mutate the vars in the write/readObjectImpl directly). When that happened, having the vars private makes clear that only the methods of LSCtrl can mutate them. Doing that I hoped that the analogy of read vs. write is apparent because both write/readObjectImpl and those write/readInLoop, write/readAsSeen are simple enough to see the mirroring. The goal is readability. If you find the "inlined" version more readable, I've absolutely nothing against it. ad 2,3) Yes. The idea is that the list is serialized in the loop, and each sub-list contributes just its head and itself, so to say. So technically the serialization stream consists of multiple of single-head unconnected lists (well they don't have tail in the stream so they are not lists). The loop makes sure all of them are in, and during the deserialization the loop connects them again. This is controlled by the inLoop variable. When the writeObjectImpl sees inLoop = true, it knows someone else manages the loop and that it must serialize its head only, and that it has to set the seen to true, to tell that someone it did so. So the seen is like a response, that the list (not the head) was there and contributed itself and it's head. I don't like the name "seen". I don't know a better name now, but we should find one. The "inLoop" is more critical, it's like a request, and thus the inLoop = true can't escape, that's why I set it to false in finally. For the same reason the "inLoop" is set to false before the head is serialized. Head can be or can contain a list, that one must not enter writeObject with inLoop = true, obviously. So if after calling writeInLoop the ctrl.seen is false, then it was either Nil or already serialized shared sub-list. There is no need to distinguish the two, in either case the loop must stop. ad 4) Yes, I tried to make read a clear inverse of write. I've added 1 :-) test into the gist. It is a "shared" thing, and it deserializes properly. ad side note) You mean like oos.beginWriteObject(obj): Boolean (false if already there and ID written, true if not and ID generated and definition starts) and oos.endWriteObject()? Regards... |
@axel22 said: Otherwise, looks good to me! |
Jan Vanek (j3vanek) said: Thanks! I'll try to do that during the weekend. Of course if there is pressure, please do that now. |
Jan Vanek (j3vanek) said: |
@axel22 said: |
Jan Vanek (j3vanek) said: |
@retronym said: |
Jan Vanek (j3vanek) said: |
@axel22 said: |
@axel22 said: |
@axel22 said: a part of the test just got commented out. That part of the test was supposed to check if legacy list deserialization works.
As I was unsure what are binary compatibility requirements exactly say on serialization, I've implemented legacy deserialization for lists deserialized from 2.9.x. I've just implemented a fallback for legacy deserialization for lists serialized with 2.10.0. |
Jan Vanek (j3vanek) said: |
@axel22 said: In any case, would be grateful for any pointers into what our binary compatibility policy has to say. |
@axel22 said: |
Jan Vanek (j3vanek) said: |
@axel22 said: |
@axel22 said: Changing the list size above object Test {
def deepCopy[T](obj : T, reportSize: Boolean = false): T = {
val baos = new ByteArrayOutputStream()
val oos = new ObjectOutputStream(baos)
oos.writeObject(obj)
oos.flush()
val data = baos.toByteArray
if (reportSize) println(data.length)
val bais = new ByteArrayInputStream(data)
val ois = new ObjectInputStream(bais)
ois.readObject().asInstanceOf[T]
}
def test() {
case class Foo(tail: List[Foo])
def create(len: Int): List[Foo] = if (len == 0) Nil else {
val tail = create(len - 1)
Foo(tail) :: tail
}
val xs = List.fill(32)(7)
val dxs = deepCopy(xs, true)
assert(xs == dxs)
val ys = create(32)
val dys = deepCopy(ys, true)
assert(ys == dys)
}
def main(args: Array[String]) {
test()
}
} |
Jan Vanek (j3vanek) said (edited on Jan 29, 2013 2:49:29 PM UTC): There is one thing, during the deserialization, e.g. in your example the Foo (which is a head) will be deserialized, pointing to the list (Foo.tail) which exists, but the heads in there were not yet deserialized. So, if there is a custom serialization code in Foo (readObject), which scans over the list, then we have a problem :-(. So we should store and read the heads of those in-line tails actually in reverse order, which is not really possible, unless we allocate whole damn list. But even if we do that in reverse order, it is not a guarantee, one could create a list whose sub-sub-..-list's head points to that list. And that head's readObject would see the list in not-fully-deserialized state. As soon as we split into 2 loops the deserialization (readObject) can observe "weird" state. So I think we need to choose how to die. I'd vote for this new death, we'd have to tell users to not scan/iterate over lists in their readObject methods, because those lists might not be fully deserialized yet. |
@axel22 said: The list buffers which suffered from the loss of structural sharing were fixed anyway. People who use lists and structural sharing will have to implement custom serialization - the consensus was that a common use case is just having a very big list which should be serialized quickly and using the least space as possible. Additionally, this will simplify list serialization code greatly. |
Jan Vanek (j3vanek) said: |
Jan Vanek (j3vanek) said: |
@retronym said: |
@axel22 said: |
@JamesIry said: |
Compiling and running this program will result in the issue that it overflows the stack at runtime in 2.10 but not in 2.9. Please fix this - it's a critical bug.
Here is the stack trace:
The text was updated successfully, but these errors were encountered: