Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex.unapplySeq should use m.find(0) instead of m.matches #5045

Closed
scabug opened this issue Sep 29, 2011 · 24 comments
Closed

Regex.unapplySeq should use m.find(0) instead of m.matches #5045

scabug opened this issue Sep 29, 2011 · 24 comments
Assignees
Milestone

Comments

@scabug
Copy link

scabug commented Sep 29, 2011

In Regex the unapplySeq method uses m.matches to determine if the regex should extract matching groups. This should instead use m.find(0).

Using m.matches acts as if the pattern were implicitly anchored. Consider

val Re = """(x)""".r
val Re(x) = "xyz"
val Re(y) = "x"

When using a regular expression a user can state explicitly that they want anchoring and the Regex library should not mandate that decision. As an aside this often leads to people surrounding their patterns with ".PTRN." which can turn a nice fast pattern into one with heavy backtracking. See Mastering Regular Expressions by Friedl for details.

@scabug
Copy link
Author

scabug commented Sep 29, 2011

Imported From: https://issues.scala-lang.org/browse/SI-5045?orig=1
Reporter: Lanny Ripple (lanny ripple)
Affected Versions: 2.9.1

@scabug
Copy link
Author

scabug commented Sep 29, 2011

Lanny Ripple (lanny ripple) said:
Gotta love Jira. That should read

val Re = """(x)""".r
val Re(x) = "xyz"
val Re(y) = "x"

@scabug
Copy link
Author

scabug commented Sep 29, 2011

Lanny Ripple (lanny ripple) said:
And the other one

".*PTRN.*"

@scabug
Copy link
Author

scabug commented Oct 9, 2011

@dcsobral said:
This was an horrible decision whose time for fixing is long gone. It was long gone even when I first learned of it, years ago.

The best I can think of is make this behavior depend on a flag, defaulting to present behavior, and with a method returning a sensible behavior-flagged regex. Something like:

val Re = """(x)""".r.notAnchored

I suggest you implement it and submit through a pull request on github, then post the link to the pull request here. It is by far the way more likely to produce results.

@scabug
Copy link
Author

scabug commented Oct 11, 2011

Lanny Ripple (lanny ripple) said:
I'm confused why you say the time to fix this is long gone? It's always a good time to fix broken things. Perhaps with a deprecation schedule and lots of visibility to the change but those are doable things if needed.

Also this wouldn't change existing code. A Regex designed to match an entire string because of implicit anchors will match an entire string in the absence of implicit anchors as well. At worst I could see code that was thought to be working but wasn't (testing ftw) start to work. On the other hand this change would bring Scala's matching into line with how other regex engines work (i.e, what you see in the regex is what you get in the match).

All that being said you are the guy with all the experience here so I'll code up the change as you suggest and go forward from there.

@scabug
Copy link
Author

scabug commented Oct 11, 2011

Lanny Ripple (lanny ripple) said:
Github pull request - scala/scala#100

Implemented using a subclass that the super is aware of (re: unanchor()). Not terribly happy with that but didn't want to move Regex to an abstract class. Maybe a RegexLike trait? Feedback appreciated.

@scabug
Copy link
Author

scabug commented Oct 12, 2011

@paulp said:
I could be wrong, but I think he's saying it should have been fixed years ago, not that the statute of limitations on fixing it has run out.

@scabug
Copy link
Author

scabug commented Oct 12, 2011

@dcsobral said:
Ok, let me clarify what I think. First, I have been disgusted by the present behavior since 2009. I might even have opened a ticket about. I consider it wrong, but it is a bit in line with Java behavior. String's "matches" works that way.

I am saying that "fixing" it could break things. For example:

string match {
  case p1() => ...
  case p2() => ...
}

An application might well depend on p1 not doing a partial match, because that's how it has always worked.

If a @migration warning -- which did not exist 2 years ago -- is considered good enough, then, yes, I'd love for it to be truly fixed. People can always adjust their regex by enclosing them in anchors if they want the previous behavior.

The solution I proposed enables people who want "correct" behavior to be able to get it, while not breaking any existing code.

Now, just a bit of review on the pull request, I'm a bit concerned about the use of "lazy val" for pattern. Access to a "val" is much faster than to a "lazy val", and pretty much every method needs a compiled pattern. So, while one might save compiling a pattern twice when creating an "unanchored" version, one looses performance everywhere else.

@scabug
Copy link
Author

scabug commented Oct 12, 2011

Lanny Ripple (lanny ripple) said:
Mostly poor phrasing on my part I think. In the same paragraph I start "Also this wouldn't change existing code." I concede that code might unexpected start to work which is as much an unintended consequence as any bug. I probably should have gone back and rewritten the paragraph then.

Good point on lazy val. One also loses performance on the Regex compile but I missed that lazy would impact all existing code. Admittedly only once and the docs point out that the Regex should be constructed once rather than in a loop but can't count on that being followed. Hmm.

@scabug
Copy link
Author

scabug commented Oct 13, 2011

@dcsobral said (edited on Oct 13, 2011 1:46:40 AM UTC):
Ok, here's an example of code that might stop working:

val emptyLine = """\s*(?:#.*)?""".r
val code = """([^#]*)(?:#.*)?""".r
def stripComments(lines: Seq[String]) = lines flatMap {
    case emptyLine => None
    case code(x)   => Some(x)
}

This will stop working if matches suddenly become unanchored.

As for lazy val, it doesn't have an impact only once. It has an impact every time pattern gets accessed, which is pretty much by any method called on the regex. The first time it goes through synchronization, but the other times it still has to go through a volatile flag check before retrieving the value.

@scabug
Copy link
Author

scabug commented Oct 13, 2011

Lanny Ripple (lanny ripple) said:
Daniel,

I've stated twice now that I'm aware there would be changes in behavior if the implicit anchors were removed. I think we can say we're in violent agreement on that point.

The problem I'm bumping into is construction order. Regex initializes pattern during construction. I don't see a way to make Regex abstract or a trait (really changing existing behavior) which argues that RegexUnAnchored be a subclass which is then returned as the super type with appropriate pattern initialization after construction. My understanding of how to do that is either to use an abstract trait with initialization at the class declaration level [i.e.,

class RegexUA extends { // init here } with Regex with ...

] which is out because Regex as an abstract trait would break

new Regex("(x)")

(and who knows what else) or to use a lazy val to initialize after full class construction. My understanding of lazy is based on the single Stack Overflow question. If it's correct then the initialization of a lazy is based on a volatile but subsequent access is a simple if() clause. That seems pretty light-weight to me given Scala and Java's hint to not recompile Patterns in tight loops but I don't have enough experience in this area to argue the point.

So what am I left with? I have no idea since the two ways documented for getting around order of initialization problems are off the table. I get the feeling I'm frustrating you. It certainly isn't my intent. I recognize and appreciate that you don't have to be offering your advice. I've done development long enough to know that if someone makes suggestions to you when submitting fly-by RFEs you listen to them. (I have also read your blog entries, SO answers, and other scala comments so have some idea that besides offering advice you also know what you are talking about when you do.) However, ... I'll just leave it at however.

I have an internal solution that works. If anyone bumps into this RFE it's simple to subclass Regex with an overridden unapplySeq which does not implicitly anchor. A case class and an implicit def to turn strings into this class using a method (having done perl in a past life I use .qr) is then straightforward.

@scabug
Copy link
Author

scabug commented Oct 18, 2011

@dcsobral said:
I asked for opinions on scala-debate on whether changing the semantics is acceptable. Personally, I'd prefer that as well.

@scabug
Copy link
Author

scabug commented Oct 19, 2011

Lanny Ripple (lanny ripple) said:
Thanks, Daniel.

@scabug
Copy link
Author

scabug commented Jan 25, 2012

@dcsobral said:
The pull request was lost when Scala migrated to github. Could you please remake it?

@scabug
Copy link
Author

scabug commented Jan 25, 2012

Lanny Ripple (lanny ripple) said:
Thanks for the note, Daniel, but after the scala-debate discussions its probably best just to mark this "Will not address".

The Regex code seems to have better documentation now so that will have to be enough.

@scabug
Copy link
Author

scabug commented Jan 25, 2012

Lanny Ripple (lanny ripple) said (edited on Jan 25, 2012 4:08:01 PM UTC):
Ok. And having said that a new way around this problem occurred to me that doesn't change existing behavior. Will work up a patch and add it as a pull request.

scala/scala#137

@scabug
Copy link
Author

scabug commented Mar 12, 2012

@dcsobral said:
I don't know if you saw the recent discussion on scala-language, but you should bring it up the proposed API changes there if you wish to see this incorporated. I know it was discussed before, but the discussion was mostly about people rejecting the previous proposal, instead of accepting the new proposal.

@scabug
Copy link
Author

scabug commented Mar 15, 2012

Lanny Ripple (lanny ripple) said:
Third round of suggested code. This time with tests. scala/scala#273

@scabug
Copy link
Author

scabug commented Mar 15, 2012

Jason Wheeler (bigwheels16) said (edited on Mar 15, 2012 6:29:06 PM UTC):
just so I (and others reading) are clear about what we are talking about, a regex such as

{code}"""(\d+)""".r{code}

will behave as if it was

{code}"""^(\d+)$""".r{code}

with the start-of-line and end-of-line anchors automatically added in the current implementation?

@scabug
Copy link
Author

scabug commented Mar 15, 2012

Lanny Ripple (lanny ripple) said:
Yes, sorry. A lot of this got brought up on scala-debate but I should be clearer here in the jira.

As Regex is currently implemented its unapplySeq uses .match() which attempts to exactly match the regex to the provided string. In effect it adds the start and end anchors to your regex. There are several problems with this. First, you are already supplying a regex and if you wanted anchors presumably you would have supplied them. I have taught six people Scala and every one has tripped over this with, "Why is it not doing what I tell it?", being the most common complaint. Second, the quick-and-dirty way to get around this is to add ".\*" to the start and end of your pattern (e.g., {noformat}""".(\d+).""".r{noformat}) but this can have a rather surprising performance impact in certain situations because of backtracking.

The current pull request adds a trait which can be used to override unapplySeq to use .find(0) instead of .match(). Additionally it adds a .qr method into StringLike so that a Regex with NoImplicitAnchors can be easily generated just like using .r to get a Regex.

Hope that states the problem and fix better.

@scabug
Copy link
Author

scabug commented Mar 15, 2012

@dcsobral said (edited on Mar 15, 2012 7:04:46 PM UTC):
And, just to make it explicit, nothing changes with regards to existing code. The change presently proposed, after discussion on scala-debate, just add a way to get non-anchored behavior.

@scabug
Copy link
Author

scabug commented May 3, 2012

@paulp said:
Let me know if the committed fix is unsatisfactory. No more methods on String though.

@scabug scabug closed this as completed May 3, 2012
@scabug
Copy link
Author

scabug commented May 3, 2012

@dcsobral said:
That's pretty much my original suggestion (I used notAnchored instead of unanchored, but otherwise the same). The method on String was suggested to avoid compiling the pattern twice, as it happens with the committed patch.

@scabug
Copy link
Author

scabug commented May 3, 2012

Lanny Ripple (lanny ripple) said:
Nice, Paul. Much tighter than my patch.

I can understand that everyone must want to decorate String and didn't hold any real belief that .qr would get in there. Easy enough to pimp in locally if needed.

Glad this has been put to bed.

@scabug scabug added this to the 2.10.0 milestone Apr 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants