Combination of "rep" and regex parsers is slow for large files in combinator parsing #8509

scabug · 2014-04-16T19:39:39Z

I need to parse files that have millions of lines. I noticed that my combinator parser gets slower and slower as it parses more and more lines. The problem seems to be in scala's "rep" or regex parsers, because this behaviour occurs even for the simple example parser shown below:

def file: Parser[Int] = rep(line) ^^ { 1 } // a file is a repetition of lines

def line: Parser[Int] = """(?m)^.*$""".r ^^ { 0 } // reads a line and returns 0

When I try to parse a file with millions of lines of equal length with this simple parser, in the beginning it parses 46 lines/ms. After 370000 lines, the speed drops to 20 lines/ms. After 840000 lines, it drops to 10 lines/ms. After 1790000 lines, 5 lines/ms...

I know that, in principle, combinator parsers were not made for parsing inputs that are so large. But wouldn't it be possible to improve "rep" and the regex parsers so that this performance degradation does not occur?

I asked a related question in Stackoverflow: http://stackoverflow.com/questions/23117635/why-is-scalas-combinator-parsing-slow-when-parsing-large-files-what-can-i-do

Thanks!

scabug · 2014-04-16T19:39:39Z

Imported From: https://issues.scala-lang.org/browse/SI-8509?orig=1
Reporter: Bruno Woltzenlogel Paleo (bruno.wp)
Duplicates #7710

scabug · 2014-04-17T07:24:51Z

Bruno Woltzenlogel Paleo (bruno.wp) said:
This seems related to this other issue: #7710

Maybe there is something else to this issue, though. I did some profiling, and memory consumption and GC time is not the problem for me.

scabug · 2014-06-29T15:12:12Z

@gourlaysama said:
After discussion in #7710 and the corresponding PR, I am closing this one as a duplicate.

scabug closed this as completed Jun 29, 2014

scabug added enhancement duplicate labels Apr 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combination of "rep" and regex parsers is slow for large files in combinator parsing #8509

Combination of "rep" and regex parsers is slow for large files in combinator parsing #8509

scabug commented Apr 16, 2014

scabug commented Apr 16, 2014

scabug commented Apr 17, 2014

scabug commented Jun 29, 2014

Combination of "rep" and regex parsers is slow for large files in combinator parsing #8509

Combination of "rep" and regex parsers is slow for large files in combinator parsing #8509

Comments

scabug commented Apr 16, 2014

scabug commented Apr 16, 2014

scabug commented Apr 17, 2014

scabug commented Jun 29, 2014