RegexParsers.scala has O(inputlength) memory performance on java >= 7u6 #7710

scabug · 2013-07-31T10:23:41Z

From 1.7.0_06 onwards, String.substring() (and .subSequence) was changed to no longer re-use the internal char[] data, but make a copy instead. Since RegexParsers.scala:109 calls subSequence() for every character parsed, it now effectively re-allocates the whole remaining parse content for every parse step.

This shows in horrible parse performance and GC for parsing a 3MB file using https://github.com/ngocdaothanh/scaposer , which would parse almost instantly in Java 6.

Details on the changes to java.lang.String are mentioned here:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6924259
http://java-performance.info/changes-to-string-java-1-7-0_06/
http://grokbase.com/t/gg/scala-user/132v5z1678/performance-of-javatokenparsers-with-java7

I guess one way around it would be wrapping CharSequence in a simple buffer, that does re-use the underlying CharSequence, adding in skip/count fields that maintain the current position.

scabug · 2013-07-31T10:23:41Z

Imported From: https://issues.scala-lang.org/browse/SI-7710?orig=1
Reporter: Jan Ypma (jypma)
Affected Versions: 2.10.1, 2.10.2, 2.11.0

scabug · 2013-12-28T21:51:03Z

Zach Moazeni (zmoazeni) said:
I took Jan's suggestion and put together a FastCharSequence and it reclaimed the performance for java >= 7u6 when wrapping a String. I don't know if this belongs in the standard library, but I'm posting it here in case others find it useful.

// FastCharSequence.scala
import java.lang.CharSequence

class FastCharSequence(chars: Array[Char], val startBounds: Int, val endBounds: Int) extends CharSequence {
  def this(chars: Array[Char]) = this(chars, 0, chars.length)
  def this(input: String)      = this(input.toCharArray)

  def length(): Int = endBounds - startBounds

  def charAt(index: Int): Char = {
    if (index < length) {
      chars(index + startBounds)
    } else {
      throw new IndexOutOfBoundsException(s"$boundsInfo index: $index")
    }
  }

  def subSequence(start: Int, end: Int): CharSequence = {
    if (start >= 0 && start <= length && end >= 0 && end <= length) {
      new FastCharSequence(chars, startBounds + start, startBounds + end)
    } else {
     throw new IndexOutOfBoundsException(s"$boundsInfo start: $start, end $end")
    }
  }

  override def toString(): String = new String(chars, startBounds, length)

  private def boundsInfo = s"current: (startBounds: $startBounds, endBounds: $endBounds, length: $length, chars length: ${chars.length})"
}

scabug · 2013-12-28T22:24:47Z

@Ichoran said:
@zmoazeni Great! Do you have the parser changes available also? If you submit a pull request, I'll check it and try to get it included in 2.11.

scabug · 2013-12-28T22:32:56Z

Zach Moazeni (zmoazeni) said:
@rex Oh, I didn't make any changes to the parser code at all. I'm using the stock code from 2.10.3.

All I'm doing is wrapping a String in this class and parsing that instead of parsing the String directly. This conforms to the CharSequence interface (which the scala parsing code depends on). In java < 7u6, the underlying character array was reused within String.java. Now String.java copies the underlying array each time for safety.

So this class just reuses that array at level higher and any others created from subSequence.

I considered memoizing toString since new String(chars, startBounds, length), will iterate over the array each time, but perf seemed good enough without it.

scabug · 2013-12-28T22:40:19Z

@Ichoran said:
@zmoazeni - Ah, I get it. It's been a while since I looked at the parser code. I'll look into putting this into 2.11 at least.

scabug · 2014-04-14T22:02:24Z

Tony Sloane (asloane) said:
Did this make it any further? Will it be in a release soon?

scabug · 2014-04-14T23:19:31Z

@Ichoran said:
[~asloane] - No, I didn't get around to adding it to 2.11, unfortunately.

scabug · 2014-04-15T00:09:53Z

Tony Sloane (asloane) said:
Pity. I guess it is easier now that the parser combinators are split out into a separate library (although that doesn't help 2.10 at all, I guess). Anything I can do to help?

scabug · 2014-04-17T07:37:40Z

Bruno Woltzenlogel Paleo (bruno.wp) said:
I have observed this problem too (see:
http://stackoverflow.com/questions/23117635/why-is-scalas-combinator-parsing-slow-when-parsing-large-files-what-can-i-do/23119600?noredirect=1#23119600). I am looking forward to something better in Scala 2.11.

scabug · 2014-04-28T18:05:53Z

@gourlaysama said:
PR: scala/scala-parser-combinators#17
See #8542 for the benchmark used.

scabug · 2014-06-25T09:41:16Z

@adriaanm said:
scala/scala-parser-combinators#17

scabug · 2014-07-01T13:16:02Z

Stéphane Landelle (slandelle) said:
Will this fix be backported to 2.10?

scabug · 2014-07-02T14:12:49Z

@gourlaysama said:
Sure.
Backport for 2.10.5: scala/scala#3860

scabug closed this as completed Jun 25, 2014

scabug added blocker has PR performance labels Apr 7, 2017

scabug added this to the 2.11.2 milestone Apr 7, 2017

scabug assigned gourlaysama Apr 7, 2017

This was referenced Apr 7, 2017

Combination of "rep" and regex parsers is slow for large files in combinator parsing #8509

Closed

Parser Combinators - parsing String is worse than StringReader #8542

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RegexParsers.scala has O(inputlength) memory performance on java >= 7u6 #7710

RegexParsers.scala has O(inputlength) memory performance on java >= 7u6 #7710

scabug commented Jul 31, 2013

scabug commented Jul 31, 2013

scabug commented Dec 28, 2013 •

edited by retronym

scabug commented Dec 28, 2013

scabug commented Dec 28, 2013

scabug commented Dec 28, 2013

scabug commented Apr 14, 2014

scabug commented Apr 14, 2014

scabug commented Apr 15, 2014

scabug commented Apr 17, 2014

scabug commented Apr 28, 2014

scabug commented Jun 25, 2014

scabug commented Jul 1, 2014

scabug commented Jul 2, 2014

RegexParsers.scala has O(inputlength) memory performance on java >= 7u6 #7710

RegexParsers.scala has O(inputlength) memory performance on java >= 7u6 #7710

Comments

scabug commented Jul 31, 2013

scabug commented Jul 31, 2013

scabug commented Dec 28, 2013 • edited by retronym

scabug commented Dec 28, 2013

scabug commented Dec 28, 2013

scabug commented Dec 28, 2013

scabug commented Apr 14, 2014

scabug commented Apr 14, 2014

scabug commented Apr 15, 2014

scabug commented Apr 17, 2014

scabug commented Apr 28, 2014

scabug commented Jun 25, 2014

scabug commented Jul 1, 2014

scabug commented Jul 2, 2014

scabug commented Dec 28, 2013 •

edited by retronym