Scala Programming Language
  1. Scala Programming Language
  2. SI-6409

Stream flatMap leaks memory if mapper returns chain of empty results

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: Scala 2.11.1-RC1
    • Component/s: Collections
    • Labels:
      None

      Description

      The reference to the original stream head is not released until the mapper function produces a non-empty stream. This can be a problem if the elements consume considerable memory, and the mapper returns a long chain of empy streams.

      scala> def run[A](s: Stream[A]) = s.flatMap(_ => Stream.empty[Int])
      run: [A](s: Stream[A])scala.collection.immutable.Stream[Int]
      
      scala> def bigElemts = Stream.range(1, 1000000).map(x => List.fill(1000)("big"))
      bigElemts: scala.collection.immutable.Stream[List[java.lang.String]]
      
      scala> run(bigElemts)
      java.lang.OutOfMemoryError: Java heap space
      

      With Stream's head not being lazy, I'm not sure this can be solved. But at least would be nice to include a hint in the flatMap apidocs.

        Activity

        Hide
        Robin Palotai added a comment -

        Note also if you are concerned that the above run method holds to the stream head:

        scala> @scala.annotation.tailrec def len[A](s: Stream[A], acc: Int = 0): Int = if (s.isEmpty) acc else len(s.tail, acc + 1)
        len: [A](s: Stream[A], acc: Int)Int
        
        scala> len(bigElemts)
        res3: Int = 999999
        
        scala> len(bigElemts.flatMap(_ => Stream.empty[Int]))
           ... OOM
        

        So it is flatMap holding to the head.

        Show
        Robin Palotai added a comment - Note also if you are concerned that the above run method holds to the stream head: scala> @scala.annotation.tailrec def len[A](s: Stream[A], acc: Int = 0): Int = if (s.isEmpty) acc else len(s.tail, acc + 1) len: [A](s: Stream[A], acc: Int)Int scala> len(bigElemts) res3: Int = 999999 scala> len(bigElemts.flatMap(_ => Stream.empty[Int])) ... OOM So it is flatMap holding to the head.
        Hide
        Kota Mizushima added a comment - - edited

        Your example code doesn't indicate that the above run method doesn't holds to the stream head because len method is a tail recursive and the tail recursive method jumps to beginning of the method and rewrite the corresponding parameter s when the method itself is called.

        However, my previous comment was also wrong. The reason of this problem is not that the run method holds to the stream head.
        The reason of this problems is that flatMap continues to evaluate each thunk of Stream until non-empty sequence is produced.

        In len(bigElemts), because s is rewritten to its tail when calling len(s.tail, acc + 1), memory leaks doesn't happen.
        In len(bigElemts.flatMap(_ => Stream.empty[Int])), all thunks (LIst.fill(1000)("big")) of bigElemts, the number of bigElemts is 1000000, is evaluated to big List. Then, much memory is consumed. The head of bigElemts cannot be released in the flatMap because bigElemts itself is this reference, which is not rewritable.

        Show
        Kota Mizushima added a comment - - edited Your example code doesn't indicate that the above run method doesn't holds to the stream head because len method is a tail recursive and the tail recursive method jumps to beginning of the method and rewrite the corresponding parameter s when the method itself is called. However, my previous comment was also wrong. The reason of this problem is not that the run method holds to the stream head. The reason of this problems is that flatMap continues to evaluate each thunk of Stream until non-empty sequence is produced. In len(bigElemts) , because s is rewritten to its tail when calling len(s.tail, acc + 1) , memory leaks doesn't happen. In len(bigElemts.flatMap(_ => Stream.empty [Int] )) , all thunks ( LIst.fill(1000)("big") ) of bigElemts , the number of bigElemts is 1000000, is evaluated to big List. Then, much memory is consumed. The head of bigElemts cannot be released in the flatMap because bigElemts itself is this reference, which is not rewritable.
        Hide
        Robin Palotai added a comment -

        Kota: Exactly that is my problem. While the flatMap API doc mentions that just like map, flatMap is lazy. One would naively assume (without a deeper understanding of Scala's Stream and its flatMap) that flatMap behaves well memory-wise too.

        However when the flatMap's mapper yields a long chain of empty Streams, this behavior does not hold, which is contrary to the (naive and wrong) expectations. An API doc entry would save many hours of debugging (especially in a setup where it is less trivial which Stream transform step is the cause of the leak).

        By the way, yielding a long run of empty streams is not artificial (I know you were not stating this, just expressing my opinion), it happened in a live dataset.

        Show
        Robin Palotai added a comment - Kota: Exactly that is my problem. While the flatMap API doc mentions that just like map, flatMap is lazy. One would naively assume (without a deeper understanding of Scala's Stream and its flatMap) that flatMap behaves well memory-wise too. However when the flatMap's mapper yields a long chain of empty Streams, this behavior does not hold, which is contrary to the (naive and wrong) expectations. An API doc entry would save many hours of debugging (especially in a setup where it is less trivial which Stream transform step is the cause of the leak). By the way, yielding a long run of empty streams is not artificial (I know you were not stating this, just expressing my opinion), it happened in a live dataset.
        Hide
        Josh Suereth added a comment -

        https://github.com/scala/scala/pull/1397

        Not a fix, but documentation for now. Workaround may be more extensive. Will look into it for 2.10.1.

        Show
        Josh Suereth added a comment - https://github.com/scala/scala/pull/1397 Not a fix, but documentation for now. Workaround may be more extensive. Will look into it for 2.10.1.
        Hide
        James Iry added a comment -

        2.10.2 is about to be cut. Kicking down the road and un-assigning to foster work stealing.

        Show
        James Iry added a comment - 2.10.2 is about to be cut. Kicking down the road and un-assigning to foster work stealing.
        Hide
        Adriaan Moors added a comment -

        I seem to remember a related ticket/PR, but couldn't find it.

        Show
        Adriaan Moors added a comment - I seem to remember a related ticket/PR, but couldn't find it.
        Hide
        Adriaan Moors added a comment -

        Since 2.11.0-RC1 is one week away, pushing all non-blockers without PR to 2.11.1-RC1. Please undo the change if I missed work in progress.

        Show
        Adriaan Moors added a comment - Since 2.11.0-RC1 is one week away, pushing all non-blockers without PR to 2.11.1-RC1. Please undo the change if I missed work in progress.

          People

          • Assignee:
            Rex Kerr
            Reporter:
            Robin Palotai
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:

              Development