Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect, inconsistent regex handling of $ in input string #10198

Closed
scabug opened this issue Feb 18, 2017 · 3 comments
Closed

Incorrect, inconsistent regex handling of $ in input string #10198

scabug opened this issue Feb 18, 2017 · 3 comments

Comments

@scabug
Copy link

scabug commented Feb 18, 2017

scala.util.matching.Regex.replaceAllIn has some problems in handling "$" (the dollar sign) in the input string. In some cases, it treats $3 (for example) in the input as a capture group replacement / back-reference. To do so is incorrect.

I'll walk you through some examples, starting with examples that work. Also, please notice the variation in behavior.

I'll start with a pattern I used in some NLP processing to strip 's from possessive tokens.

val pattern = """([^']+)'s""".r
pattern: scala.util.matching.Regex = ([^']+)'s

scala> pattern.replaceAllIn("Bob's", m => m.group(1) )
res0: String = Bob

scala> pattern.replaceAllIn("Frank", m => m.group(1) )
res1: String = Frank

scala> pattern.replaceAllIn("$300", m => m.group(1) )
res2: String = $300

scala> pattern.replaceAllIn("$300's", m => m.group(1) )
java.lang.IndexOutOfBoundsException: No group 3
at java.util.regex.Matcher.start(Matcher.java:375)
at java.util.regex.Matcher.appendReplacement(Matcher.java:880)
at scala.util.matching.Regex$Replacement.replace(Regex.scala:861)
at scala.util.matching.Regex$Replacement.replace$(Regex.scala:861)
at scala.util.matching.Regex$MatchIterator$$anon$1.replace(Regex.scala:839)
at scala.util.matching.Regex.$anonfun$replaceAllIn$1(Regex.scala:491)
at scala.collection.Iterator.foreach(Iterator.scala:929)
at scala.collection.Iterator.foreach$(Iterator.scala:929)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1406)
at scala.util.matching.Regex.replaceAllIn(Regex.scala:491)
... 29 elided

Above, please note that the first argument to pattern.replaceAllIn is an input string. There should be no handling of capture groups (e.g. $2) in an input string.

scala> pattern.replaceAllIn("$x300's", m => m.group(1) )
java.lang.IllegalArgumentException: Illegal group reference
at java.util.regex.Matcher.appendReplacement(Matcher.java:857)
at scala.util.matching.Regex$Replacement.replace(Regex.scala:861)
at scala.util.matching.Regex$Replacement.replace$(Regex.scala:861)
at scala.util.matching.Regex$MatchIterator$$anon$1.replace(Regex.scala:839)
at scala.util.matching.Regex.$anonfun$replaceAllIn$1(Regex.scala:491)
at scala.collection.Iterator.foreach(Iterator.scala:929)
at scala.collection.Iterator.foreach$(Iterator.scala:929)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1406)
at scala.util.matching.Regex.replaceAllIn(Regex.scala:491)
... 29 elided

scala> pattern.replaceAllIn("$600's", m => m.group(1) )
java.lang.IndexOutOfBoundsException: No group 6
at java.util.regex.Matcher.start(Matcher.java:375)
at java.util.regex.Matcher.appendReplacement(Matcher.java:880)
at scala.util.matching.Regex$Replacement.replace(Regex.scala:861)
at scala.util.matching.Regex$Replacement.replace$(Regex.scala:861)
at scala.util.matching.Regex$MatchIterator$$anon$1.replace(Regex.scala:839)
at scala.util.matching.Regex.$anonfun$replaceAllIn$1(Regex.scala:491)
at scala.collection.Iterator.foreach(Iterator.scala:929)
at scala.collection.Iterator.foreach$(Iterator.scala:929)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1406)
at scala.util.matching.Regex.replaceAllIn(Regex.scala:491)
... 29 elided

Note the variation of behavior here (no error)

scala> pattern.replaceAllIn("$1200's", m => m.group(1) )
res7: String = $1200200

@scabug
Copy link
Author

scabug commented Feb 18, 2017

Imported From: https://issues.scala-lang.org/browse/SI-10198?orig=1
Reporter: David James (dj49)
Affected Versions: 2.12.1

@scabug
Copy link
Author

scabug commented Feb 18, 2017

David James (dj49) said:
I see now that Regex.quoteReplacement can be used, as follows:

"""([^']+)'s""".r.replaceAllIn(token, m => Regex.quoteReplacement(m.group(1)))

However, this seems like a lot of work to do a common thing (replace based on a capture group).

The other situation, where someone wants to replace with a "$3" and then have that trigger a back-reference substitution, seems much rarer.

I realize this is getting a little ranty. It was painful. Anyhow, I don't want to break backwards compatibility, but this does break the principle of least surprise.

@scabug
Copy link
Author

scabug commented Feb 18, 2017

@som-snytt said (edited on Feb 18, 2017 9:37:00 PM UTC):
I feel your pain, but it's clearly documented in the scaladoc and is just the way Java does it.

In my browser, I have to click on the method to show the explanation: $n is a group, so use quoteReplacement. Or in this example, use _ => "$1". The use case for the group method here would be to conditionally construct the replacement string.

This why we no longer construct APIs around strings. I think they're going to spin scala.util out into its own module, so maybe folks will devise a better scala.util.matching, working from accumulated knowledge.

Also, a linter could warn about unescaped $1 in a replacement string.

@scabug scabug closed this as completed Feb 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant