Bug in split() when delimiters occur at end of string #5069

scabug · 2011-10-10T06:51:47Z

Behavior is unexpected in both string.split(char) and regex.split(string) when a delimiter occurs at the end of a line. Examples:

scala> "foo=bar".split('=')
res26: Array[String] = Array(foo, bar)

scala> "=foo=bar".split('=')
res27: Array[String] = Array("", foo, bar)

scala> "=foo==bar".split('=')
res28: Array[String] = Array("", foo, "", bar)

scala> "==foo==bar".split('=')
res29: Array[String] = Array("", "", foo, "", bar)

scala> "foo=bar=".split('=')
res30: Array[String] = Array(foo, bar)

scala> "foo=bar==".split('=')
res31: Array[String] = Array(foo, bar)

scala> "==foo==bar==".split('=')
res32: Array[String] = Array("", "", foo, "", bar)

The last three results (res30 - res32) are errors, and res32 shows most clearly the problematic results. Delimiters at the beginning lead to an empty string in the results, and multiple delimiters lead to multiple empty strings. Similarly multiple delimiters in the middle lead to empty strings. All of this is correct. However, single or multiple delimiters at the end are simply ignored.

Note that if we were to "define" the behavior of split() with end delimiters to be the correct handling of delimiters at edges, we're then faced with the fact that the behavior at the beginning is incorrect.

In fact, I argue that the actual behavior with leading delimiters and multiple delimiters in the middle is correct, and that trailing delimiters should work the same way. For example, one way to implement a simple database is to put one record per line and use some sort of delimiter to separate them. This is, for example, how /etc/passwd, /etc/group, and similar Unix "databases" work, using : as the delimiter. It's common in such databases for certain fields to be empty, and so if you use split() to split a record into fields, and the last field happens to be empty, or maybe the last few fields happen to be empty, you get too few values returned. In this case if all records have the same number of fields, you can hack around it without too much difficulty, but it gets messier if you have records with variable numbers of fields.

Note that if you actually want the behavior where beginning and/or ending delimiters don't "count", you can easily implement this simply by appropriately filtering out empty strings from the returned array of splits.

I should also note that in Python, split() of a string does the "right thing" as I defined it above.

Exactly the same bug occurs in both string.split(char) and regex.split(string). string.split(char) maps directly to java.lang.String.split(regex:String), and that presumably maps to java.util.regex.Pattern.split(). Presumably Scala's regex.split(string) also maps to java.util.regex.Pattern.split(). Hence the "bug" is in Java.

Note that this bug isn't documented in Java; the documentation for Pattern.split() implies it should work the way I consider correct.

It could be argued that we should simply propagate Java's bug, but I disagree. Scala is its own language with its own libraries. Java just happens to be the platform we use to implement the Scala libraries, but there's no guarantee it always will be -- in fact there's also a Scala for .NET being produced. Should string split on Scala for .NET be forced to implement bugs in Java's libraries? Or should we just expose the .NET string split functions directly, even though there will likely be different behavior since it probably won't be broken in exactly the way that Java's string split is, and hence code cannot be ported from one to the other? Neither seems a good idea.

Instead I think we should "do it right" even if it means that we have to write our own string-splitting code.

The only other issue is whether (1) to simply fix the behavior of split() or (2) to deprecate it and create a new split function that works properly. Option (2) is better for strict compatibility but I suggest Option (1), since the buggy behavior of split() isn't actually documented anywhere and in general Scala has made other changes that break code when switching to new releases.

ben

scabug · 2011-10-10T06:51:47Z

Imported From: https://issues.scala-lang.org/browse/SI-5069?orig=1
Reporter: Ben Wing (benwing)
Affected Versions: 2.9.1

scabug · 2011-10-10T07:08:51Z

@paulp said:

Note that this bug isn't documented in Java; the documentation for Pattern.split() implies it should work the way I consider correct.

The behavior is documented both at String#split and Pattern#split. I'm not sure what you are reading.

String: "Trailing empty strings are therefore not included in the resulting array."

Pattern: "If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded."

Sometimes you have to live with imperfections. The benefit to deviating from java's behavior has to be quite marked to compensate for the fact that it is different. "Fixing" this fails that test.

scabug · 2011-10-16T09:01:36Z

Ben Wing (benwing) said:
Sorry, I didn't read the third paragraph closely enough. In fact, I see now that if I set the limit to -1, I get the behavior I want.

scabug closed this as completed Oct 10, 2011

scabug added the won't fix label Apr 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in split() when delimiters occur at end of string #5069

Bug in split() when delimiters occur at end of string #5069

scabug commented Oct 10, 2011

scabug commented Oct 10, 2011

scabug commented Oct 10, 2011

scabug commented Oct 16, 2011

Bug in split() when delimiters occur at end of string #5069

Bug in split() when delimiters occur at end of string #5069

Comments

scabug commented Oct 10, 2011

scabug commented Oct 10, 2011

scabug commented Oct 10, 2011

scabug commented Oct 16, 2011