Handling of strings with a NUL byte #9915

scabug · 2016-09-07T15:02:42Z

Scala appears to encode strings with a NUL byte "\u000" differently than Java. To reproduce this define a compile time constant in Java.

Create a Java file (Test.java):

public class Test {
  public static final String TEST = "\0ABC";
}

Create a sample Scala main program and access TEST from there.

When accessing the string TEST from scala, the NUL byte appears to be encoded with 2 bytes. So, even simple equality tests like

"\0ABC".equals(Test.TEST)

fail and return false.

However, when TEST is made a private field in the class, and returned from class Test from a static getter function - getTestValue(), the equality check {code:java} "\0ABC".equals(Test.getTestValue()) {code} passes and returns true.

I took a look at the generated bytecode, and I suspect defining TEST as a compile time constant and makes this difference (as opposed to accessing it via a getter()).

Can someone please explain if I'm missing something obvious related to encoding of Strings? Some insight on the problem will be helpful.

scabug · 2016-09-07T15:02:42Z

Imported From: https://issues.scala-lang.org/browse/SI-9915?orig=1
Reporter: jagadish (jagadish1989)
Affected Versions: 2.11.0

scabug · 2016-09-07T19:39:49Z

@som-snytt said (edited on Sep 7, 2016 7:41:00 PM UTC):
The modified encoding isn't handled (jvms 4.4.7):

There are two differences between this format and the "standard" UTF-8 format.
First, the null character (char)0 is encoded using the 2-byte format rather than the
1-byte format, so that modified UTF-8 strings never have embedded nulls. Second,
only the 1-byte, 2-byte, and 3-byte formats of standard UTF-8 are used. The Java
Virtual Machine does not recognize the four-byte format of standard UTF-8; it uses
its own two-times-three-byte format instead.

So, 0 and supplementary chars in constants are fffd.

scala> val x = new str.Test
x: str.Test = str.Test@c808207

scala> x.str
res0: String = imagine JIRA allowed posting unicode comments instead of reporting "communications failure"

scala> str.Test.STR
res1: String = ����������������

scabug · 2016-09-08T03:51:50Z

jagadish (jagadish1989) said (edited on Sep 8, 2016 4:04:39 AM UTC):
I understand the difference between utf 8 and modified utf 8. But, I'm not clear where exactly the mismatch happens in this scenario. (or where the bug - if any should be). Does scalac not support modified utf 8 encoding? I assume this is a compile time constant. If so, do scalac and javac generate different copies of the same constant? (Sorry if I'm missing something, I'm no expert on compiler internals.)

scabug · 2016-09-08T05:06:46Z

@som-snytt said:
scala/scala#5384

scabug · 2016-09-08T05:30:05Z

@som-snytt said:
Thanks, you got it exactly right. This was a learning experience for me, and thanks to your info, not entirely fruitless.

Unfortunately, there is no JIRA label for not-entirely-fruitless. It was fun adding "java-interop", which is only a slight stretch.

The problem was that the constants were ingested incorrectly from the other class file. Now the copies are not differing, modulo separate recompilation.

scabug · 2016-09-08T15:41:54Z

jagadish (jagadish1989) said (edited on Sep 8, 2016 3:42:36 PM UTC):
That seems to align with my initial suspicion (in the previous comment).

Thanks for the pull request. It will be super-helpful if you add a comment in the code that - " Java class file constants are encoded using Modified utf-8 encoding and reference 4.4.7 of the JVM spec. "

That way the reader has an idea that the reason behind doing newTermName(fromMUTF8(in.buf, start, len + 2)) is that, the first byte is for the TAG (which in this case is CONSTANT_UTF-8), the second byte represents the length of the class-file constant, and the actual constant starts from the third byte.

scabug · 2016-09-08T16:09:29Z

@som-snytt said:
You can comment on the PR itself; but the 2 bytes are the length, and start is the index of the length. firstExpecting already advanced past the first byte. I guess I'm not super-helpful.

scabug · 2016-09-08T16:39:49Z

jagadish (jagadish1989) said:
Thanks for the pointer! I'll take this discussion in the PR.

scabug closed this as completed Nov 29, 2016

scabug added help wanted has PR java interop quickfix minimized labels Apr 7, 2017

scabug added this to the 2.12.1 milestone Apr 7, 2017

scabug assigned som-snytt Apr 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of strings with a NUL byte #9915

Handling of strings with a NUL byte #9915

scabug commented Sep 7, 2016

scabug commented Sep 7, 2016

scabug commented Sep 7, 2016

scabug commented Sep 8, 2016

scabug commented Sep 8, 2016

scabug commented Sep 8, 2016

scabug commented Sep 8, 2016 •

edited by retronym

scabug commented Sep 8, 2016

scabug commented Sep 8, 2016

Handling of strings with a NUL byte #9915

Handling of strings with a NUL byte #9915

Comments

scabug commented Sep 7, 2016

scabug commented Sep 7, 2016

scabug commented Sep 7, 2016

scabug commented Sep 8, 2016

scabug commented Sep 8, 2016

scabug commented Sep 8, 2016

scabug commented Sep 8, 2016 • edited by retronym

scabug commented Sep 8, 2016

scabug commented Sep 8, 2016

scabug commented Sep 8, 2016 •

edited by retronym