New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigDecimal.hashCode() has high collision rate #6153
Comments
Imported From: https://issues.scala-lang.org/browse/SI-6153?orig=1 |
@paulp said: |
@harrah said: import math.{BigDecimal,BigInt}
val i: BigInt = 1000000
val d: BigDecimal = 1000000
val i4 = i pow 4
val d4 = d pow 4
i4 == d4 // true
i4.hashCode == d4.hashCode // false |
@non said: Paul: Agreed; in many ways I don't think BigInt/BigDecimal belong in stdlib, but rather in a library that can support them more fully (and or opt for a better implementation). Anyway, I think this looks good and I'll try to help get a patch ready. |
@paulp said: https://github.com/paulp/scala/tree/issue/6153 I would be thrilled if you took it from here. |
@soc said: |
Bennet Huber (bhuber) said: |
@Ichoran said: |
@Ichoran said: |
Scala's BigDecimal.hashCode() implementation, while not technically broken, is implemented in such a way as to nearly guarantee a large number of collisions. The current implementation looks like this:
unifiedPrimitiveHashcode essentially casts this to an Int and returns the result.
There are several problems with this implementation. For the time being, let's ignore whole numbers caught by if(isWhole) and assume we're working with numbers having fractional parts. The fact that the number's Double representation is used to calculate the hash value in this case yields the following undesirable properties:
Here is some simple test code that demonstrates the problem:
BigDecimal values that are whole numbers fare slightly better. Since they are cast to Ints, their lower 32 bits are used as the hash value, solving the nearby neighbor collision problem. Unfortunately, since only the first 32 bits are used, it is trivial to generate collisions between them simply by flipping bits beyond the first 32. This will lead to poor performance for any algorithms using them as masks, multiplying or dividing by large powers of two, working with Mersenne primes, etc.
The following improved hashCode() implementation, presented as pseudo-code, should solve both of these problems and is relatively simple. Since hashCode would be more expensive to calculate, it might make sense to store the result as well. I haven't tested this implementation.
The text was updated successfully, but these errors were encountered: