Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

indexOfSlice() hangs when working on a largish stream #9830

Closed
scabug opened this issue Jun 25, 2016 · 5 comments
Closed

indexOfSlice() hangs when working on a largish stream #9830

scabug opened this issue Jun 25, 2016 · 5 comments

Comments

@scabug
Copy link

scabug commented Jun 25, 2016

This takes a few secs but works

val source = scala.io.Source.fromChars(("x" * 6000000).toArray)
source.toSeq.indexOfSlice("tteesstt")

modify the 6000000 to 7000000 and it hangs, eating CPU (though not memory).

Seems that it's the indexOfSlice that's failing.

@scabug
Copy link
Author

scabug commented Jun 25, 2016

Imported From: https://issues.scala-lang.org/browse/SI-9830?orig=1
Reporter: ImNotTellingYouThat (intyt)
Affected Versions: 2.11.8

@scabug
Copy link
Author

scabug commented Jun 28, 2016

Jasper-M said:
Are you sure that the JVM's memory isn't full and the CPU usage you're seeing isn't just the garbage collector?

@scabug
Copy link
Author

scabug commented Jun 28, 2016

ImNotTellingYouThat (intyt) said:
Jasper Moeys: I thought it unlikely as there was no obvious memory use but I've just repeated it and bingo,

java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.immutable.StreamIterator.(Stream.scala:1104
at scala.collection.immutable.Stream.iterator(Stream.scala:578)
at scala.collection.SeqLike$class.startsWith(SeqLike.scala:304)
...

So

  1. it locked up before, now it crashes; is this a flaw in the JVM or in scala? Bear in mind that when I reported this it was just hanging to the point I had to kill the terminal window, so when I repeated the test (hanging each time) it was starting a new JVM instance.

  2. Why, on windows' task manager, and I not seeing any significant memory use? I have plenty to spare.

  3. why should it run out of mem? Bear in mind I know very little about scala but with this code
    source.toSeq.indexOfSlice("tteesstt")
    toSeq produces a lazy structure:

scala> source.toSeq
res0: Seq[Char] = Stream(x, ?)

so the obvious question is, is indexOfSlice hanging on to the head of the stream as it works its way along, because how else is memory being retained? What should be happening do you think? (I'm asking because I genuinely don't know).

cheers

jan

@scabug
Copy link
Author

scabug commented Jul 3, 2016

ImNotTellingYouThat (intyt) said:
Further, I've just repeated this to look at memory use - and this time it hung (and has hung for several minutes at 100% cpu) rather than excepted.
When I started scala in the JVM, the JVM was at ~240 meg. With it hanging it's at 360 meg. This trivial amount of memory is well within what a 32-bit JVM should be able to handle, never mind a 64-bit one. I've actually got ~10 gig of memory free.

@scabug scabug closed this as completed Aug 10, 2016
@scabug
Copy link
Author

scabug commented Aug 10, 2016

@SethTisue said:
Thank you for the report!

This doesn't have anything to do with indexOfSlice in particular; substituting e.g. last shows the same behavior.

The underlying issue here is that at runtime, calling .toSeq on Iterator returns a scala.collection.immutable.Stream, which is a very expensive data structure (both in space and time) — though the fact that its tail is lazy means that you don't pay the cost until you actually traverse it.

In general, toSeq is something of a trap in that as a type, Seq gives you almost no guarantees. The Seq you get back might be strict or lazy, compact or memory hungry, finite or infinite, stack overflow prone or stack safe, etc etc etc. For small collections it often doesn't matter but as soon as you're slinging big amounts of data around you probably want to be working with concrete collection types so you know what you're getting. Substitute toVector for toSeq here, and it will run pretty fast.

So, I've responded here on JIRA, but not to all of your questions. They are good questions, but I suggest asking them on scala-user, the scala/scala Gitter channel, or Stack Overflow. (If you have followup questions about what I've said here, same recommendation about where to take the discussion.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant