Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of strings with a NUL byte #9915

Closed
scabug opened this issue Sep 7, 2016 · 8 comments
Closed

Handling of strings with a NUL byte #9915

scabug opened this issue Sep 7, 2016 · 8 comments

Comments

@scabug
Copy link

scabug commented Sep 7, 2016

Scala appears to encode strings with a NUL byte "\u000" differently than Java. To reproduce this define a compile time constant in Java.

Create a Java file (Test.java):

public class Test {
  public static final String TEST = "\0ABC";
}

Create a sample Scala main program and access TEST from there.

When accessing the string TEST from scala, the NUL byte appears to be encoded with 2 bytes. So, even simple equality tests like

"\0ABC".equals(Test.TEST) 

fail and return false.

However, when TEST is made a private field in the class, and returned from class Test from a static getter function - getTestValue(), the equality check {code:java} "\0ABC".equals(Test.getTestValue()) {code} passes and returns true.

I took a look at the generated bytecode, and I suspect defining TEST as a compile time constant and makes this difference (as opposed to accessing it via a getter()).

Can someone please explain if I'm missing something obvious related to encoding of Strings? Some insight on the problem will be helpful.

@scabug
Copy link
Author

scabug commented Sep 7, 2016

Imported From: https://issues.scala-lang.org/browse/SI-9915?orig=1
Reporter: jagadish (jagadish1989)
Affected Versions: 2.11.0

@scabug
Copy link
Author

scabug commented Sep 7, 2016

@som-snytt said (edited on Sep 7, 2016 7:41:00 PM UTC):
The modified encoding isn't handled (jvms 4.4.7):

There are two differences between this format and the "standard" UTF-8 format.
First, the null character (char)0 is encoded using the 2-byte format rather than the
1-byte format, so that modified UTF-8 strings never have embedded nulls. Second,
only the 1-byte, 2-byte, and 3-byte formats of standard UTF-8 are used. The Java
Virtual Machine does not recognize the four-byte format of standard UTF-8; it uses
its own two-times-three-byte format instead.

So, 0 and supplementary chars in constants are fffd.

scala> val x = new str.Test
x: str.Test = str.Test@c808207

scala> x.str
res0: String = imagine JIRA allowed posting unicode comments instead of reporting "communications failure"

scala> str.Test.STR
res1: String = ����������������

@scabug
Copy link
Author

scabug commented Sep 8, 2016

jagadish (jagadish1989) said (edited on Sep 8, 2016 4:04:39 AM UTC):
I understand the difference between utf 8 and modified utf 8. But, I'm not clear where exactly the mismatch happens in this scenario. (or where the bug - if any should be). Does scalac not support modified utf 8 encoding? I assume this is a compile time constant. If so, do scalac and javac generate different copies of the same constant? (Sorry if I'm missing something, I'm no expert on compiler internals.)

@scabug
Copy link
Author

scabug commented Sep 8, 2016

@som-snytt said:
scala/scala#5384

@scabug
Copy link
Author

scabug commented Sep 8, 2016

@som-snytt said:
Thanks, you got it exactly right. This was a learning experience for me, and thanks to your info, not entirely fruitless.

Unfortunately, there is no JIRA label for not-entirely-fruitless. It was fun adding "java-interop", which is only a slight stretch.

The problem was that the constants were ingested incorrectly from the other class file. Now the copies are not differing, modulo separate recompilation.

@scabug
Copy link
Author

scabug commented Sep 8, 2016

jagadish (jagadish1989) said (edited on Sep 8, 2016 3:42:36 PM UTC):
That seems to align with my initial suspicion (in the previous comment).

Thanks for the pull request. It will be super-helpful if you add a comment in the code that - " Java class file constants are encoded using Modified utf-8 encoding and reference 4.4.7 of the JVM spec. "

That way the reader has an idea that the reason behind doing newTermName(fromMUTF8(in.buf, start, len + 2)) is that, the first byte is for the TAG (which in this case is CONSTANT_UTF-8), the second byte represents the length of the class-file constant, and the actual constant starts from the third byte.

@scabug
Copy link
Author

scabug commented Sep 8, 2016

@som-snytt said:
You can comment on the PR itself; but the 2 bytes are the length, and start is the index of the length. firstExpecting already advanced past the first byte. I guess I'm not super-helpful.

@scabug
Copy link
Author

scabug commented Sep 8, 2016

jagadish (jagadish1989) said:
Thanks for the pointer! I'll take this discussion in the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants