Scala Programming Language
  1. Scala Programming Language
  2. SI-6153

BigDecimal.hashCode() has high collision rate

    Details

      Description

      Scala's BigDecimal.hashCode() implementation, while not technically broken, is implemented in such a way as to nearly guarantee a large number of collisions. The current implementation looks like this:

      override def hashCode(): Int =
          if (isWhole) unifiedPrimitiveHashcode
          else doubleValue.##
      

      unifiedPrimitiveHashcode essentially casts this to an Int and returns the result.

      There are several problems with this implementation. For the time being, let's ignore whole numbers caught by if(isWhole) and assume we're working with numbers having fractional parts. The fact that the number's Double representation is used to calculate the hash value in this case yields the following undesirable properties:

      • All BigDecimal values greater than Double.MaxValue have equivalent hash values, since they all map to Double.PositiveInfinity
      • All BigDecimal values less than Double.MinValue have equivalent hash values for the same reason
      • All BigDecimal values with absolute value smaller than Double.MinPositiveValue have equivalent hash values, since they all map to 0.0
      • All BigDecimal values with more than 17 significant digits have nearby neighboring values with the same hash value, since all the digits after the first 17 are truncated by the cast to Double

      Here is some simple test code that demonstrates the problem:

      val doubleEpsilon: BigDecimal = BigDecimal(Double.MinPositiveValue)     //4.9E-324
      val subLong: BigDecimal = BigDecimal(Long.MaxValue / 2)
      val smallNumbers: Set[Int] = (1 to 100) map (i => (doubleEpsilon * i / 1000).##) toSet
      val largeNumbers: Set[Int] = (1 to 100) map (i => (subLong + i + 0.1).##) toSet
      //smallNumbers.size == 1
      //largeNumbers.size == 1
      

      BigDecimal values that are whole numbers fare slightly better. Since they are cast to Ints, their lower 32 bits are used as the hash value, solving the nearby neighbor collision problem. Unfortunately, since only the first 32 bits are used, it is trivial to generate collisions between them simply by flipping bits beyond the first 32. This will lead to poor performance for any algorithms using them as masks, multiplying or dividing by large powers of two, working with Mersenne primes, etc.

      The following improved hashCode() implementation, presented as pseudo-code, should solve both of these problems and is relatively simple. Since hashCode would be more expensive to calculate, it might make sense to store the result as well. I haven't tested this implementation.

      override def hashCode(): Int = {
          if (isWhole && this < Int.MaxValue && this > Int.MinValue)
              unifiedPrimitiveHashcode
              //This makes the hashCode implementation consistent with Int representations
          else {
              //convert to string (fully expanded, no scientific notation)
              //if string contains ".":
              //  strip trailing 0s
              //  strip trailing .
              //
              //return string's hashcode
          }
      }
      

        Activity

        Hide
        Paul Phillips added a comment -

        I don't see a problem with this implementation, though we should find a way out of the number nurturing business.

        Show
        Paul Phillips added a comment - I don't see a problem with this implementation, though we should find a way out of the number nurturing business.
        Hide
        Mark Harrah added a comment -

        Looks to me like hashCode is broken:

        import math.{BigDecimal,BigInt}
        val i: BigInt = 1000000
        val d: BigDecimal = 1000000
        val i4 = i pow 4
        val d4 = d pow 4
        i4 == d4 // true
        i4.hashCode == d4.hashCode // false
        
        Show
        Mark Harrah added a comment - Looks to me like hashCode is broken: import math.{BigDecimal,BigInt} val i: BigInt = 1000000 val d: BigDecimal = 1000000 val i4 = i pow 4 val d4 = d pow 4 i4 == d4 // true i4.hashCode == d4.hashCode // false
        Hide
        Erik Osheim added a comment -

        Mark: I'm not sure what guarantees Scala expects to make. Reading the library, it seems like "unified hash codes" only apply to integer values between Int.MinValue and Int.MaxValue, so in this case I don't think the hash codes are supposed to be equal. That said, obviously I could imagine different strategies.

        Paul: Agreed; in many ways I don't think BigInt/BigDecimal belong in stdlib, but rather in a library that can support them more fully (and or opt for a better implementation).

        Anyway, I think this looks good and I'll try to help get a patch ready.

        Show
        Erik Osheim added a comment - Mark: I'm not sure what guarantees Scala expects to make. Reading the library, it seems like "unified hash codes" only apply to integer values between Int.MinValue and Int.MaxValue, so in this case I don't think the hash codes are supposed to be equal. That said, obviously I could imagine different strategies. Paul: Agreed; in many ways I don't think BigInt/BigDecimal belong in stdlib, but rather in a library that can support them more fully (and or opt for a better implementation). Anyway, I think this looks good and I'll try to help get a patch ready.
        Hide
        Paul Phillips added a comment -

        Regardless of the range of unified hash codes, within the constraints imposed by reality, if two things are equal they must have the same hashcode. Here is a no-pretense-of-performance attempt to address distribution and correctness issues I wrote a few hours ago.

        https://github.com/paulp/scala/tree/issue/6153

        I would be thrilled if you took it from here.

        Show
        Paul Phillips added a comment - Regardless of the range of unified hash codes, within the constraints imposed by reality, if two things are equal they must have the same hashcode. Here is a no-pretense-of-performance attempt to address distribution and correctness issues I wrote a few hours ago. https://github.com/paulp/scala/tree/issue/6153 I would be thrilled if you took it from here.
        Hide
        Simon Ochsenreither added a comment -

        I'll need to keep this in mind for the Scala-native BigInt implementation I'm working on.
        Are there any guarantees concerning the stability of hashCode values between different Scala versions?

        Show
        Simon Ochsenreither added a comment - I'll need to keep this in mind for the Scala-native BigInt implementation I'm working on. Are there any guarantees concerning the stability of hashCode values between different Scala versions?
        Hide
        Bennet Huber added a comment -

        I've begun working on a patch for this that should solve all of these issues (collisions, consistency between types, and equivalent values of BigDecimal generating the same hashcode). It's mostly done, just needs some fixup and polish. Unfortunately, I won't have time to finish it today, but I hope to submit a pull request by the end of the week.

        Show
        Bennet Huber added a comment - I've begun working on a patch for this that should solve all of these issues (collisions, consistency between types, and equivalent values of BigDecimal generating the same hashcode). It's mostly done, just needs some fixup and polish. Unfortunately, I won't have time to finish it today, but I hope to submit a pull request by the end of the week.
        Hide
        Rex Kerr added a comment -

        Any progress on this, or shall I fix it?

        Show
        Rex Kerr added a comment - Any progress on this, or shall I fix it?
        Hide
        Rex Kerr added a comment -

        There is no sane way to achieve all of the following: (1) low collision rate; (2) BigInt and BigDecimal hashCode agrees with equality (w.r.t. each other); (3) BigDecimal can't explode the heap. The problem is that BigInt isn't decimal, so the hash code computation of 1e1000000000 would require you to churn through the base-2 representation of the billion-digit decimal number. Not fun. I could given enough time perhaps invent a hashing function that had some simplifying form for lots of zeros so that powers of ten of BigInt would have easy-to-compute hash codes. But this is way too time-consuming to get right (if it is even possible). Instead, I've applied a simple heuristic: small numbers of digits convert BigDecimal to BigInt. Large numbers of digits do a different calculation on BigDecimal alone. This is part of https://github.com/scala/scala/pull/3316

        Show
        Rex Kerr added a comment - There is no sane way to achieve all of the following: (1) low collision rate; (2) BigInt and BigDecimal hashCode agrees with equality (w.r.t. each other); (3) BigDecimal can't explode the heap. The problem is that BigInt isn't decimal, so the hash code computation of 1e1000000000 would require you to churn through the base-2 representation of the billion-digit decimal number. Not fun. I could given enough time perhaps invent a hashing function that had some simplifying form for lots of zeros so that powers of ten of BigInt would have easy-to-compute hash codes. But this is way too time-consuming to get right (if it is even possible). Instead, I've applied a simple heuristic: small numbers of digits convert BigDecimal to BigInt. Large numbers of digits do a different calculation on BigDecimal alone. This is part of https://github.com/scala/scala/pull/3316

          People

          • Assignee:
            Rex Kerr
            Reporter:
            Bennet Huber
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development