Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigDecimal.hashCode() has high collision rate #6153

Closed
scabug opened this issue Jul 28, 2012 · 9 comments
Closed

BigDecimal.hashCode() has high collision rate #6153

scabug opened this issue Jul 28, 2012 · 9 comments
Assignees
Milestone

Comments

@scabug
Copy link

scabug commented Jul 28, 2012

Scala's BigDecimal.hashCode() implementation, while not technically broken, is implemented in such a way as to nearly guarantee a large number of collisions. The current implementation looks like this:

override def hashCode(): Int =
    if (isWhole) unifiedPrimitiveHashcode
    else doubleValue.##

unifiedPrimitiveHashcode essentially casts this to an Int and returns the result.

There are several problems with this implementation. For the time being, let's ignore whole numbers caught by if(isWhole) and assume we're working with numbers having fractional parts. The fact that the number's Double representation is used to calculate the hash value in this case yields the following undesirable properties:

  • All BigDecimal values greater than Double.MaxValue have equivalent hash values, since they all map to Double.PositiveInfinity
  • All BigDecimal values less than Double.MinValue have equivalent hash values for the same reason
  • All BigDecimal values with absolute value smaller than Double.MinPositiveValue have equivalent hash values, since they all map to 0.0
  • All BigDecimal values with more than 17 significant digits have nearby neighboring values with the same hash value, since all the digits after the first 17 are truncated by the cast to Double

Here is some simple test code that demonstrates the problem:

val doubleEpsilon: BigDecimal = BigDecimal(Double.MinPositiveValue)     //4.9E-324
val subLong: BigDecimal = BigDecimal(Long.MaxValue / 2)
val smallNumbers: Set[Int] = (1 to 100) map (i => (doubleEpsilon * i / 1000).##) toSet
val largeNumbers: Set[Int] = (1 to 100) map (i => (subLong + i + 0.1).##) toSet
//smallNumbers.size == 1
//largeNumbers.size == 1

BigDecimal values that are whole numbers fare slightly better. Since they are cast to Ints, their lower 32 bits are used as the hash value, solving the nearby neighbor collision problem. Unfortunately, since only the first 32 bits are used, it is trivial to generate collisions between them simply by flipping bits beyond the first 32. This will lead to poor performance for any algorithms using them as masks, multiplying or dividing by large powers of two, working with Mersenne primes, etc.

The following improved hashCode() implementation, presented as pseudo-code, should solve both of these problems and is relatively simple. Since hashCode would be more expensive to calculate, it might make sense to store the result as well. I haven't tested this implementation.

override def hashCode(): Int = {
    if (isWhole && this < Int.MaxValue && this > Int.MinValue)
        unifiedPrimitiveHashcode
        //This makes the hashCode implementation consistent with Int representations
    else {
        //convert to string (fully expanded, no scientific notation)
        //if string contains ".":
        //  strip trailing 0s
        //  strip trailing .
        //
        //return string's hashcode
    }
}
@scabug
Copy link
Author

scabug commented Jul 28, 2012

Imported From: https://issues.scala-lang.org/browse/SI-6153?orig=1
Reporter: Bennet Huber (bhuber)
Affected Versions: 2.10.0

@scabug
Copy link
Author

scabug commented Jul 29, 2012

@paulp said:
I don't see a problem with this implementation, though we should find a way out of the number nurturing business.

@scabug
Copy link
Author

scabug commented Jul 29, 2012

@harrah said:
Looks to me like hashCode is broken:

import math.{BigDecimal,BigInt}
val i: BigInt = 1000000
val d: BigDecimal = 1000000
val i4 = i pow 4
val d4 = d pow 4
i4 == d4 // true
i4.hashCode == d4.hashCode // false

@scabug
Copy link
Author

scabug commented Jul 29, 2012

@non said:
Mark: I'm not sure what guarantees Scala expects to make. Reading the library, it seems like "unified hash codes" only apply to integer values between Int.MinValue and Int.MaxValue, so in this case I don't think the hash codes are supposed to be equal. That said, obviously I could imagine different strategies.

Paul: Agreed; in many ways I don't think BigInt/BigDecimal belong in stdlib, but rather in a library that can support them more fully (and or opt for a better implementation).

Anyway, I think this looks good and I'll try to help get a patch ready.

@scabug
Copy link
Author

scabug commented Jul 29, 2012

@paulp said:
Regardless of the range of unified hash codes, within the constraints imposed by reality, if two things are equal they must have the same hashcode. Here is a no-pretense-of-performance attempt to address distribution and correctness issues I wrote a few hours ago.

https://github.com/paulp/scala/tree/issue/6153

I would be thrilled if you took it from here.

@scabug
Copy link
Author

scabug commented Jul 29, 2012

@soc said:
I'll need to keep this in mind for the Scala-native BigInt implementation I'm working on.
Are there any guarantees concerning the stability of hashCode values between different Scala versions?

@scabug
Copy link
Author

scabug commented Jul 30, 2012

Bennet Huber (bhuber) said:
I've begun working on a patch for this that should solve all of these issues (collisions, consistency between types, and equivalent values of BigDecimal generating the same hashcode). It's mostly done, just needs some fixup and polish. Unfortunately, I won't have time to finish it today, but I hope to submit a pull request by the end of the week.

@scabug
Copy link
Author

scabug commented Sep 10, 2013

@Ichoran said:
Any progress on this, or shall I fix it?

@scabug
Copy link
Author

scabug commented Dec 29, 2013

@Ichoran said:
There is no sane way to achieve all of the following: (1) low collision rate; (2) BigInt and BigDecimal hashCode agrees with equality (w.r.t. each other); (3) BigDecimal can't explode the heap. The problem is that BigInt isn't decimal, so the hash code computation of 1e1000000000 would require you to churn through the base-2 representation of the billion-digit decimal number. Not fun. I could given enough time perhaps invent a hashing function that had some simplifying form for lots of zeros so that powers of ten of BigInt would have easy-to-compute hash codes. But this is way too time-consuming to get right (if it is even possible). Instead, I've applied a simple heuristic: small numbers of digits convert BigDecimal to BigInt. Large numbers of digits do a different calculation on BigDecimal alone. This is part of scala/scala#3316

@scabug scabug closed this as completed Jan 15, 2014
@scabug scabug added this to the 2.11.0-M8 milestone Apr 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants