Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegexParsers.scala has O(inputlength) memory performance on java >= 7u6 #7710

Closed
scabug opened this issue Jul 31, 2013 · 13 comments
Closed

RegexParsers.scala has O(inputlength) memory performance on java >= 7u6 #7710

scabug opened this issue Jul 31, 2013 · 13 comments

Comments

@scabug
Copy link

scabug commented Jul 31, 2013

From 1.7.0_06 onwards, String.substring() (and .subSequence) was changed to no longer re-use the internal char[] data, but make a copy instead. Since RegexParsers.scala:109 calls subSequence() for every character parsed, it now effectively re-allocates the whole remaining parse content for every parse step.

This shows in horrible parse performance and GC for parsing a 3MB file using https://github.com/ngocdaothanh/scaposer , which would parse almost instantly in Java 6.

Details on the changes to java.lang.String are mentioned here:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6924259
http://java-performance.info/changes-to-string-java-1-7-0_06/
http://grokbase.com/t/gg/scala-user/132v5z1678/performance-of-javatokenparsers-with-java7

I guess one way around it would be wrapping CharSequence in a simple buffer, that does re-use the underlying CharSequence, adding in skip/count fields that maintain the current position.

@scabug
Copy link
Author

scabug commented Jul 31, 2013

Imported From: https://issues.scala-lang.org/browse/SI-7710?orig=1
Reporter: Jan Ypma (jypma)
Affected Versions: 2.10.1, 2.10.2, 2.11.0

@scabug
Copy link
Author

scabug commented Dec 28, 2013

Zach Moazeni (zmoazeni) said:
I took Jan's suggestion and put together a FastCharSequence and it reclaimed the performance for java >= 7u6 when wrapping a String. I don't know if this belongs in the standard library, but I'm posting it here in case others find it useful.

// FastCharSequence.scala
import java.lang.CharSequence

class FastCharSequence(chars: Array[Char], val startBounds: Int, val endBounds: Int) extends CharSequence {
  def this(chars: Array[Char]) = this(chars, 0, chars.length)
  def this(input: String)      = this(input.toCharArray)

  def length(): Int = endBounds - startBounds

  def charAt(index: Int): Char = {
    if (index < length) {
      chars(index + startBounds)
    } else {
      throw new IndexOutOfBoundsException(s"$boundsInfo index: $index")
    }
  }

  def subSequence(start: Int, end: Int): CharSequence = {
    if (start >= 0 && start <= length && end >= 0 && end <= length) {
      new FastCharSequence(chars, startBounds + start, startBounds + end)
    } else {
     throw new IndexOutOfBoundsException(s"$boundsInfo start: $start, end $end")
    }
  }

  override def toString(): String = new String(chars, startBounds, length)

  private def boundsInfo = s"current: (startBounds: $startBounds, endBounds: $endBounds, length: $length, chars length: ${chars.length})"
}

@scabug
Copy link
Author

scabug commented Dec 28, 2013

@Ichoran said:
@zmoazeni Great! Do you have the parser changes available also? If you submit a pull request, I'll check it and try to get it included in 2.11.

@scabug
Copy link
Author

scabug commented Dec 28, 2013

Zach Moazeni (zmoazeni) said:
@rex Oh, I didn't make any changes to the parser code at all. I'm using the stock code from 2.10.3.

All I'm doing is wrapping a String in this class and parsing that instead of parsing the String directly. This conforms to the CharSequence interface (which the scala parsing code depends on). In java < 7u6, the underlying character array was reused within String.java. Now String.java copies the underlying array each time for safety.

So this class just reuses that array at level higher and any others created from subSequence.

I considered memoizing toString since new String(chars, startBounds, length), will iterate over the array each time, but perf seemed good enough without it.

@scabug
Copy link
Author

scabug commented Dec 28, 2013

@Ichoran said:
@zmoazeni - Ah, I get it. It's been a while since I looked at the parser code. I'll look into putting this into 2.11 at least.

@scabug
Copy link
Author

scabug commented Apr 14, 2014

Tony Sloane (asloane) said:
Did this make it any further? Will it be in a release soon?

@scabug
Copy link
Author

scabug commented Apr 14, 2014

@Ichoran said:
[~asloane] - No, I didn't get around to adding it to 2.11, unfortunately.

@scabug
Copy link
Author

scabug commented Apr 15, 2014

Tony Sloane (asloane) said:
Pity. I guess it is easier now that the parser combinators are split out into a separate library (although that doesn't help 2.10 at all, I guess). Anything I can do to help?

@scabug
Copy link
Author

scabug commented Apr 17, 2014

Bruno Woltzenlogel Paleo (bruno.wp) said:
I have observed this problem too (see:
http://stackoverflow.com/questions/23117635/why-is-scalas-combinator-parsing-slow-when-parsing-large-files-what-can-i-do/23119600?noredirect=1#23119600). I am looking forward to something better in Scala 2.11.

@scabug
Copy link
Author

scabug commented Apr 28, 2014

@gourlaysama said:
PR: scala/scala-parser-combinators#17
See #8542 for the benchmark used.

@scabug
Copy link
Author

scabug commented Jun 25, 2014

@scabug scabug closed this as completed Jun 25, 2014
@scabug
Copy link
Author

scabug commented Jul 1, 2014

Stéphane Landelle (slandelle) said:
Will this fix be backported to 2.10?

@scabug
Copy link
Author

scabug commented Jul 2, 2014

@gourlaysama said:
Sure.
Backport for 2.10.5: scala/scala#3860

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants