Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combination of "rep" and regex parsers is slow for large files in combinator parsing #8509

Closed
scabug opened this issue Apr 16, 2014 · 3 comments

Comments

@scabug
Copy link

scabug commented Apr 16, 2014

I need to parse files that have millions of lines. I noticed that my combinator parser gets slower and slower as it parses more and more lines. The problem seems to be in scala's "rep" or regex parsers, because this behaviour occurs even for the simple example parser shown below:

def file: Parser[Int] = rep(line) ^^ { 1 } // a file is a repetition of lines

def line: Parser[Int] = """(?m)^.*$""".r ^^ { 0 } // reads a line and returns 0

When I try to parse a file with millions of lines of equal length with this simple parser, in the beginning it parses 46 lines/ms. After 370000 lines, the speed drops to 20 lines/ms. After 840000 lines, it drops to 10 lines/ms. After 1790000 lines, 5 lines/ms...

I know that, in principle, combinator parsers were not made for parsing inputs that are so large. But wouldn't it be possible to improve "rep" and the regex parsers so that this performance degradation does not occur?

I asked a related question in Stackoverflow: http://stackoverflow.com/questions/23117635/why-is-scalas-combinator-parsing-slow-when-parsing-large-files-what-can-i-do

Thanks!

@scabug
Copy link
Author

scabug commented Apr 16, 2014

Imported From: https://issues.scala-lang.org/browse/SI-8509?orig=1
Reporter: Bruno Woltzenlogel Paleo (bruno.wp)
Duplicates #7710

@scabug
Copy link
Author

scabug commented Apr 17, 2014

Bruno Woltzenlogel Paleo (bruno.wp) said:
This seems related to this other issue: #7710

Maybe there is something else to this issue, though. I did some profiling, and memory consumption and GC time is not the problem for me.

@scabug
Copy link
Author

scabug commented Jun 29, 2014

@gourlaysama said:
After discussion in #7710 and the corresponding PR, I am closing this one as a duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant