Details

Type: Bug

Status: Closed

Priority: Major

Resolution: Fixed

Affects Version/s: Scala 2.10.0

Fix Version/s: Scala 2.11.0M8

Component/s: Misc Library

Labels:
Description
Scala's BigDecimal.hashCode() implementation, while not technically broken, is implemented in such a way as to nearly guarantee a large number of collisions. The current implementation looks like this:
override def hashCode(): Int = if (isWhole) unifiedPrimitiveHashcode else doubleValue.##
unifiedPrimitiveHashcode essentially casts this to an Int and returns the result.
There are several problems with this implementation. For the time being, let's ignore whole numbers caught by if(isWhole) and assume we're working with numbers having fractional parts. The fact that the number's Double representation is used to calculate the hash value in this case yields the following undesirable properties:
 All BigDecimal values greater than Double.MaxValue have equivalent hash values, since they all map to Double.PositiveInfinity
 All BigDecimal values less than Double.MinValue have equivalent hash values for the same reason
 All BigDecimal values with absolute value smaller than Double.MinPositiveValue have equivalent hash values, since they all map to 0.0
 All BigDecimal values with more than 17 significant digits have nearby neighboring values with the same hash value, since all the digits after the first 17 are truncated by the cast to Double
Here is some simple test code that demonstrates the problem:
val doubleEpsilon: BigDecimal = BigDecimal(Double.MinPositiveValue) //4.9E324 val subLong: BigDecimal = BigDecimal(Long.MaxValue / 2) val smallNumbers: Set[Int] = (1 to 100) map (i => (doubleEpsilon * i / 1000).##) toSet val largeNumbers: Set[Int] = (1 to 100) map (i => (subLong + i + 0.1).##) toSet //smallNumbers.size == 1 //largeNumbers.size == 1
BigDecimal values that are whole numbers fare slightly better. Since they are cast to Ints, their lower 32 bits are used as the hash value, solving the nearby neighbor collision problem. Unfortunately, since only the first 32 bits are used, it is trivial to generate collisions between them simply by flipping bits beyond the first 32. This will lead to poor performance for any algorithms using them as masks, multiplying or dividing by large powers of two, working with Mersenne primes, etc.
The following improved hashCode() implementation, presented as pseudocode, should solve both of these problems and is relatively simple. Since hashCode would be more expensive to calculate, it might make sense to store the result as well. I haven't tested this implementation.
override def hashCode(): Int = { if (isWhole && this < Int.MaxValue && this > Int.MinValue) unifiedPrimitiveHashcode //This makes the hashCode implementation consistent with Int representations else { //convert to string (fully expanded, no scientific notation) //if string contains ".": // strip trailing 0s // strip trailing . // //return string's hashcode } }
I don't see a problem with this implementation, though we should find a way out of the number nurturing business.