Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider replacing existing mutable HashMap implementation with OpenHashMap, which seems faster #5263

Closed
scabug opened this issue Dec 2, 2011 · 22 comments

Comments

@scabug
Copy link

scabug commented Dec 2, 2011

collection.mutable.OpenHashMap seems to beat collection.mutable.HashMap in performance. I posted a microbenchmark for just insertion here: https://gist.github.com/1423303. I've also found in informal testing that OpenHashMap is faster.

OpenHashMap could replace the existing implementation, after doing more extensive profiling to see if/when the existing implementation is faster.

@scabug
Copy link
Author

scabug commented Dec 2, 2011

Imported From: https://issues.scala-lang.org/browse/SI-5263?orig=1
Reporter: @pchiusano
Affected Versions: 2.9.1

@scabug
Copy link
Author

scabug commented Dec 9, 2011

@paulp said:
I ran my usual coarse benchmark for such things, compiling the scala distribution.

As-is: 7 minutes 5 seconds
Replacing HashMap with OpenHashMap: 7 minutes 59 seconds

So this is wontfix in the absence of a significantly more convincing argument.

@scabug
Copy link
Author

scabug commented Dec 9, 2011

@scabug
Copy link
Author

scabug commented Dec 10, 2011

@pchiusano said:
My whole point in opening this ticket was that more rigorous performance tests be done to determine whether changing the default implementation would make sense. If those tests have already been done, great, just point me to where the results are summarized and I'll shut up. :) If they haven't been done, why was this closed as wontfix?

It looks like Ismael has already done some testing - Ismael - can you explain what the numbers in the table mean in the link you gave? You did not actually explain that anywhere in your message. :) Either OpenHashMap was significantly worse for this benchmark, or significantly better... Besides that randomized search test what other tests did you do and can you summarize your general results?

@scabug
Copy link
Author

scabug commented Dec 11, 2011

@ijuma said:
I thought that it was clear when I said "severe problems at small sizes". Also, I linked to the benchmarks and explained that they were based on the ones used by Doug Lea and others from jsr166. I also linked to an older post explaining issues with OpenHashMap including the fact that it's not efficient when it comes to memory usage.

@scabug
Copy link
Author

scabug commented Dec 11, 2011

@pchiusano said:
Ismael - ok, maybe am being dense. :) But I am still kind of confused, since in the earlier post (http://scala-programming-language.1934581.n4.nabble.com/scala-Open-versus-Chained-HashMap-td1997654.html) you say that OpenHashMap "seems faster than" the existing impl, then in the more recent one it looks like OpenHashMap is significantly slower... at least for that one benchmark's results. Was that true across the board for all the tests you did? What accounts for this discrepancy? I saw the link to the benchmarking code you used, but I would love to see some sort of summary/writeup of your tests - when was one vs the other slower, what do you think explains the differences, etc.

@scabug
Copy link
Author

scabug commented Dec 11, 2011

@ijuma said:
Replying from my phone, so sorry for being terse. Yes, there are cases where OpenHashMap does well, but it does so bad in others that it's not suitable as the default implementation as it stands in my opinion. I intend to look at this again soon and will report back.

@scabug
Copy link
Author

scabug commented Dec 11, 2011

@paulp said:
What is your impression of the arrangement here? Do you often find people lining up to perform tests which only you have expressed an interest in and which you won't perform yourself? I can guarantee you the default implementation will not be changed if it imposes a 10+% slowdown on the compiler (regardless of whether it could be "worked around") and I'm not too sure why I should have to convince you of this.

@scabug
Copy link
Author

scabug commented Dec 11, 2011

@pchiusano said:
Ismael - okay, thanks for the info. Am interested to hear more when you do go back to take a look!

Paul - I don't think I am the only person interested in the performance of fundamental collections like hash maps. IMO, the recent reactions to the Yammer leaked email show that lots of people are interested in it - people want Scala's collections to be competitive with Java's for the use cases they care about. And since this is such a fundamental performance issue for so many people, I would have thought that Typesafe would be interested in profiling and optimizing these sorts of things even without someone from the community spearheading. But that is your call obviously. :) I also understand if you guys are really busy with other stuff - though if that is the case, maybe this ticket could be turned into a general "get Scala's mutable collections to be more competitive with Java's" ticket and put on the backlog or something.

I can guarantee you the default implementation will not be changed if it imposes a 10+% slowdown on the compiler (regardless of whether it could be "worked around") and I'm not too sure why I should have to convince you of this.

Here's my argument, and feel free to disagree :) - there is a lot of Scala code out there, including the compiler itself. As a straw man, if 99% of the Scala code out there would see a dramatic performance improvement as a result of some change, but the compiler itself would see a 10% decrease in performance, then for the good of the overall Scala community such a change might be warranted. Basically, the Scala compiler is not necessarily representative of the "average" codebase of the rest of the community of Scala users. Also, what if the compiler's slowdown is explained by something dumb like - the compiler uses lots of small maps, and the default initial size for AlternateMapImplementation is too high? If something like this were the explanation, it could have an easy fix, and then both the scala compiler and the rest of the community could see a speed improvement.

Paul, btw, I'm actually really curious what performance difference you see in the compiler if you replace the map implementation with one that delegates to java.util.Map.

Anyway feel free to close the ticket - I just, in general, don't like it when tickets get closed with zero discussion or response from the original submitter. Even if you think it's a bad bug report or whatever, I just find it sort of off-putting.

@scabug
Copy link
Author

scabug commented Dec 11, 2011

@paulp said:
I invested enough time to switch OpenHashMap in for HashMap (which is not completely trivial) and collected timings. You consumed an appreciable chunk of that day's total available effort on something you could have done yourself, so if you're trying to help keep that in mind. It's still pretty close to just me trying to cover the entire distribution and handle all incoming tickets.

I think about little but performance issues these days, but a ticket which says "make mutable collections faster" is not useful, it only contributes to one of my biggest problems, that being the amount of noise in the bug database. When I have hundreds of tickets to select from, how helpful do you think it is to have additional ones opened up with vague requirements regarding subjects which we hear about every day whether we want to or not? So if it seems a bit off-putting then that's probably intentional. This should not be read to discourage the opening of specific, fully articulated issues. It should be read to mean that a significant fraction of new tickets contribute a net negative by draining my very limited time.

@scabug
Copy link
Author

scabug commented Dec 11, 2011

@pchiusano said:
Paul - okay, fair enough. I appreciate the time you put into it.

Just as a idea, to deal with the "noise in the bug database", maybe there could be a designation of "additional info requested" or something like that for issues? Then when you receive an issue that you don't think is really actionable as reported, you designate it "additional info requested", add a comment about what other info is needed, and even assign it back to the submitter. So the issue stays alive but you don't have to look at it all the time, and there's a clear next step for the submitter to push things along. This might be better than just closing the ticket.

Anyway, just a thought. Feel free to close this ticket!

@scabug
Copy link
Author

scabug commented Dec 11, 2011

@paulp said:
There is such a status - "NEEDINFO" - but presumably I consider this more of a closed issue than you do. I did say in my closing comment "in the absence of a better argument", which is what "needinfo" would mean.

Because JIRA is a steaming pile of garbage, I can't assign anything to the submitter without firing up the admin interface and adding them to the list of "scala developers", which isn't going to happen. (The alternative, that tickets can be assigned to everyone, means that I get a dropdown list of 4000 people every time I want to assign a ticket. There is apparently no middle ground.)

@scabug
Copy link
Author

scabug commented Dec 27, 2011

@ijuma said:
Paul, I have a working version of the HashMap I want to propose as a replacement for mutable.HashMap and I would like to perform the same test you did above when you said:

"As-is: 7 minutes 5 seconds
Replacing HashMap with OpenHashMap: 7 minutes 59 seconds"

Is this as simple as replacing mutable.HashMap with my implementation and running ant or is it a more involved process? If the latter, would you please outline it so that I can make sure we improve there too (or at least don't regress)? Thanks!

@scabug
Copy link
Author

scabug commented Dec 27, 2011

@paulp said:
It is as simple as that, plus (in this case of OpenHashMap) a variety of tweaks elsewhere based on where it doesn't compile anymore due to using HashMap-specific things. Then it's "ant all.clean locker.done" to get the code in position and "time ant build" to get a time.

@scabug
Copy link
Author

scabug commented Dec 27, 2011

@ijuma said:
Perfect, thanks. I timed the whole "ant build" and it was 8 minutes 43 seconds for the existing mutable.HashMap and 8 minutes 41 seconds for the new one. Very close. In other tests, the new one does quite a bit better (around 25-30ns instead of 40ns average for a combination of important operations at various map sizes) and the new one uses significantly less memory.

I did it a bit differently than what you suggest because I didn't know about the locker.done command, but, if I understand correctly, this should not be an issue because creating the locker which should be the same for both invocations (and it was). I did run ant all.clean before each invocation.

What's the best way to proceed. File a separate issue or add details to this one?

@scabug
Copy link
Author

scabug commented Dec 27, 2011

@paulp said:
This one's still open so it'll do fine.

@scabug
Copy link
Author

scabug commented Sep 10, 2013

@Ichoran said:
We can do substantially better than OpenHashMap or HashMap. Given the lack of activity here, I'll supply my own unless Ismael still has his code around.

@scabug
Copy link
Author

scabug commented Sep 11, 2013

@ijuma said:
Please go ahead Rex. I've been far too busy with other things to follow up on this.

@scabug
Copy link
Author

scabug commented Sep 13, 2013

@Ichoran said:
I've got something within 30% of Java (i.e. takes 30% more time than Java--compare to 3x more time with current implementation) based on Erik Osheim's IntSet implementation. Oddly, the JVM deoptimizes it after a while (that's the 30% value); it's initially only 15% slower. It could be faster (faster than Java even), but it does various things to make sure that the worst case performance is a lot better than Java. Still need to try out a few variations to see what's optimal for typical cases.

@scabug
Copy link
Author

scabug commented Nov 14, 2013

@JamesIry said:
scala/scala#3119

@scabug
Copy link
Author

scabug commented Nov 15, 2013

@Ichoran said:
OpenHashMap is not robustly faster than HashMap. In particular, when repeatedly adding and removing keys it can have disastrously poor performance. The pull request referenced by James above contains a LongMap and an AnyRefMap that are substantially faster than either OpenHashMap or HashMap in almost all circumstances where either Long or AnyRef keys are used.

@scabug
Copy link
Author

scabug commented Dec 3, 2013

@Ichoran said:
HashMap cannot be replaced as there is code that depends on the details of its internals and interface. However, a new LongMap and AnyRefMap are provided that are considerably faster for those cases where they apply. In cases where full Any handling (with equality of boxed primitives across types) is required, use HashMap. Otherwise, LongMap and AnyRefMap should be faster drop-in replacements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants