New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RegexParsers.scala has O(inputlength) memory performance on java >= 7u6 #7710
Comments
Imported From: https://issues.scala-lang.org/browse/SI-7710?orig=1 |
Zach Moazeni (zmoazeni) said: // FastCharSequence.scala
import java.lang.CharSequence
class FastCharSequence(chars: Array[Char], val startBounds: Int, val endBounds: Int) extends CharSequence {
def this(chars: Array[Char]) = this(chars, 0, chars.length)
def this(input: String) = this(input.toCharArray)
def length(): Int = endBounds - startBounds
def charAt(index: Int): Char = {
if (index < length) {
chars(index + startBounds)
} else {
throw new IndexOutOfBoundsException(s"$boundsInfo index: $index")
}
}
def subSequence(start: Int, end: Int): CharSequence = {
if (start >= 0 && start <= length && end >= 0 && end <= length) {
new FastCharSequence(chars, startBounds + start, startBounds + end)
} else {
throw new IndexOutOfBoundsException(s"$boundsInfo start: $start, end $end")
}
}
override def toString(): String = new String(chars, startBounds, length)
private def boundsInfo = s"current: (startBounds: $startBounds, endBounds: $endBounds, length: $length, chars length: ${chars.length})"
} |
Zach Moazeni (zmoazeni) said: All I'm doing is wrapping a String in this class and parsing that instead of parsing the String directly. This conforms to the CharSequence interface (which the scala parsing code depends on). In java < 7u6, the underlying character array was reused within String.java. Now String.java copies the underlying array each time for safety. So this class just reuses that array at level higher and any others created from I considered memoizing |
Tony Sloane (asloane) said: |
@Ichoran said: |
Tony Sloane (asloane) said: |
Bruno Woltzenlogel Paleo (bruno.wp) said: |
@gourlaysama said: |
Stéphane Landelle (slandelle) said: |
@gourlaysama said: |
From 1.7.0_06 onwards, String.substring() (and .subSequence) was changed to no longer re-use the internal char[] data, but make a copy instead. Since RegexParsers.scala:109 calls subSequence() for every character parsed, it now effectively re-allocates the whole remaining parse content for every parse step.
This shows in horrible parse performance and GC for parsing a 3MB file using https://github.com/ngocdaothanh/scaposer , which would parse almost instantly in Java 6.
Details on the changes to java.lang.String are mentioned here:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6924259
http://java-performance.info/changes-to-string-java-1-7-0_06/
http://grokbase.com/t/gg/scala-user/132v5z1678/performance-of-javatokenparsers-with-java7
I guess one way around it would be wrapping CharSequence in a simple buffer, that does re-use the underlying CharSequence, adding in skip/count fields that maintain the current position.
The text was updated successfully, but these errors were encountered: