BigDecimal.hashCode() has high collision rate #6153

scabug · 2012-07-28T22:20:11Z

Scala's BigDecimal.hashCode() implementation, while not technically broken, is implemented in such a way as to nearly guarantee a large number of collisions. The current implementation looks like this:

override def hashCode(): Int =
    if (isWhole) unifiedPrimitiveHashcode
    else doubleValue.##

unifiedPrimitiveHashcode essentially casts this to an Int and returns the result.

There are several problems with this implementation. For the time being, let's ignore whole numbers caught by if(isWhole) and assume we're working with numbers having fractional parts. The fact that the number's Double representation is used to calculate the hash value in this case yields the following undesirable properties:

All BigDecimal values greater than Double.MaxValue have equivalent hash values, since they all map to Double.PositiveInfinity
All BigDecimal values less than Double.MinValue have equivalent hash values for the same reason
All BigDecimal values with absolute value smaller than Double.MinPositiveValue have equivalent hash values, since they all map to 0.0
All BigDecimal values with more than 17 significant digits have nearby neighboring values with the same hash value, since all the digits after the first 17 are truncated by the cast to Double

Here is some simple test code that demonstrates the problem:

val doubleEpsilon: BigDecimal = BigDecimal(Double.MinPositiveValue)     //4.9E-324
val subLong: BigDecimal = BigDecimal(Long.MaxValue / 2)
val smallNumbers: Set[Int] = (1 to 100) map (i => (doubleEpsilon * i / 1000).##) toSet
val largeNumbers: Set[Int] = (1 to 100) map (i => (subLong + i + 0.1).##) toSet
//smallNumbers.size == 1
//largeNumbers.size == 1

BigDecimal values that are whole numbers fare slightly better. Since they are cast to Ints, their lower 32 bits are used as the hash value, solving the nearby neighbor collision problem. Unfortunately, since only the first 32 bits are used, it is trivial to generate collisions between them simply by flipping bits beyond the first 32. This will lead to poor performance for any algorithms using them as masks, multiplying or dividing by large powers of two, working with Mersenne primes, etc.

The following improved hashCode() implementation, presented as pseudo-code, should solve both of these problems and is relatively simple. Since hashCode would be more expensive to calculate, it might make sense to store the result as well. I haven't tested this implementation.

override def hashCode(): Int = {
    if (isWhole && this < Int.MaxValue && this > Int.MinValue)
        unifiedPrimitiveHashcode
        //This makes the hashCode implementation consistent with Int representations
    else {
        //convert to string (fully expanded, no scientific notation)
        //if string contains ".":
        //  strip trailing 0s
        //  strip trailing .
        //
        //return string's hashcode
    }
}

The text was updated successfully, but these errors were encountered:

scabug · 2012-07-28T22:20:11Z

Imported From: https://issues.scala-lang.org/browse/SI-6153?orig=1
Reporter: Bennet Huber (bhuber)
Affected Versions: 2.10.0

scabug · 2012-07-29T02:51:53Z

@paulp said:
I don't see a problem with this implementation, though we should find a way out of the number nurturing business.

scabug · 2012-07-29T03:19:14Z

@harrah said:
Looks to me like hashCode is broken:

import math.{BigDecimal,BigInt}
val i: BigInt = 1000000
val d: BigDecimal = 1000000
val i4 = i pow 4
val d4 = d pow 4
i4 == d4 // true
i4.hashCode == d4.hashCode // false

scabug · 2012-07-29T05:01:02Z

@non said:
Mark: I'm not sure what guarantees Scala expects to make. Reading the library, it seems like "unified hash codes" only apply to integer values between Int.MinValue and Int.MaxValue, so in this case I don't think the hash codes are supposed to be equal. That said, obviously I could imagine different strategies.

Paul: Agreed; in many ways I don't think BigInt/BigDecimal belong in stdlib, but rather in a library that can support them more fully (and or opt for a better implementation).

Anyway, I think this looks good and I'll try to help get a patch ready.

scabug · 2012-07-29T07:17:18Z

@paulp said:
Regardless of the range of unified hash codes, within the constraints imposed by reality, if two things are equal they must have the same hashcode. Here is a no-pretense-of-performance attempt to address distribution and correctness issues I wrote a few hours ago.

https://github.com/paulp/scala/tree/issue/6153

I would be thrilled if you took it from here.

scabug · 2012-07-29T13:12:34Z

@soc said:
I'll need to keep this in mind for the Scala-native BigInt implementation I'm working on.
Are there any guarantees concerning the stability of hashCode values between different Scala versions?

scabug · 2012-07-30T14:33:18Z

Bennet Huber (bhuber) said:
I've begun working on a patch for this that should solve all of these issues (collisions, consistency between types, and equivalent values of BigDecimal generating the same hashcode). It's mostly done, just needs some fixup and polish. Unfortunately, I won't have time to finish it today, but I hope to submit a pull request by the end of the week.

scabug · 2013-09-10T22:57:03Z

@Ichoran said:
Any progress on this, or shall I fix it?

scabug · 2013-12-29T16:20:01Z

@Ichoran said:
There is no sane way to achieve all of the following: (1) low collision rate; (2) BigInt and BigDecimal hashCode agrees with equality (w.r.t. each other); (3) BigDecimal can't explode the heap. The problem is that BigInt isn't decimal, so the hash code computation of 1e1000000000 would require you to churn through the base-2 representation of the billion-digit decimal number. Not fun. I could given enough time perhaps invent a hashing function that had some simplifying form for lots of zeros so that powers of ten of BigInt would have easy-to-compute hash codes. But this is way too time-consuming to get right (if it is even possible). Instead, I've applied a simple heuristic: small numbers of digits convert BigDecimal to BigInt. Large numbers of digits do a different calculation on BigDecimal alone. This is part of scala/scala#3316

scabug closed this as completed Jan 15, 2014

scabug added the help wanted label Apr 7, 2017

scabug added this to the 2.11.0-M8 milestone Apr 7, 2017

scabug assigned Ichoran Apr 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigDecimal.hashCode() has high collision rate #6153

BigDecimal.hashCode() has high collision rate #6153

scabug commented Jul 28, 2012

scabug commented Jul 28, 2012

scabug commented Jul 29, 2012

scabug commented Jul 29, 2012

scabug commented Jul 29, 2012

scabug commented Jul 29, 2012

scabug commented Jul 29, 2012

scabug commented Jul 30, 2012

scabug commented Sep 10, 2013

scabug commented Dec 29, 2013

Navigation Menu

BigDecimal.hashCode() has high collision rate #6153

BigDecimal.hashCode() has high collision rate #6153

Comments

scabug commented Jul 28, 2012

scabug commented Jul 28, 2012

scabug commented Jul 29, 2012

scabug commented Jul 29, 2012

scabug commented Jul 29, 2012

scabug commented Jul 29, 2012

scabug commented Jul 29, 2012

scabug commented Jul 30, 2012

scabug commented Sep 10, 2013

scabug commented Dec 29, 2013