16747 – Poor performance of degenerated unordered_multimap

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 16747 - Poor performance of degenerated unordered_multimap

Summary: Poor performance of degenerated unordered_multimap

Status:	RESOLVED WONTFIX

Alias:	None

Product:	libc++
Classification:	Unclassified
Component:	All Bugs (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P enhancement
Assignee:	Howard Hinnant

URL:
Keywords:

Duplicates (1):	21275 (view as bug list)
Depends on:
Blocks:

Reported:	2013-07-30 08:19 PDT by Gerald Slacik
Modified:	2015-08-25 22:01 PDT (History)
CC List:	2 users (show)

See Also:	21275
Fixed By Commit(s):

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Gerald Slacik 2013-07-30 08:19:17 PDT

The following code uses an unordered_multimap to associate very few values with a long list of values. The example performs very poor compared to libstdc++ and the std lib implementation of Visual Studio 2012.


#include <chrono>
#include <iostream>
#include <string>
#include <unordered_map>

int main (int argc, char* argv[]) {
  const int N = (argc > 1) ? std::stoi(argv[1]) : 123456;
  const auto begin = std::chrono::high_resolution_clock::now();
  std::unordered_multimap<int, int> map;
  for (int i = 0; i < N; ++i) {
    map.insert({i % 4, i});
  }
  const auto end = std::chrono::high_resolution_clock::now();
  const auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin);
  std::cout << "Duration for " << map.size() << " integer inserts: " << ms.count() << "ms" << std::endl;
}



Results on Linux (SLES 11.2):
=============================
clang-3.2 with libc++ from 2013-01-31: 
Duration for 123456 integer inserts: 21605ms

clang-3.3 with libc++ from 2013-01-31: 
Duration for 123456 integer inserts: 21836ms

gcc-4.6 with GLIBCXX_3.4.16:
Duration for 123456 integer inserts: 7ms


Results on Windows 2008 Server:
===============================
Visual Studio 2012 SP3:
Duration for 123456 integer inserts, count=123456: 15ms

Results on OS X 10.8.4:
=======================
Apple LLVM version 4.2 (clang-425.0.28) (based on LLVM 3.2svn):
Duration for 123456 integer inserts: 18702ms


The Linux and Windows results are on the same hardware, the OS X test was run on a Mac Pro 2.93GHz and has similar performance to the Linux/Windows hardware.

I am building an inverse map for fast look-up. In most cases this works fine. In rare cases the map degenerates to a few values being mapped to a long list. In these rare case the performance deteriorates badly compared to other implementations.

Comment 1 Howard Hinnant 2013-07-30 11:18:53 PDT

This behavior is due to an extension for unordered containers that was meant to make the transition between multimap and unordered_multimap more seamless (in both directions).

For multimap, an insert without hint is required to insert at the end of any existing equal range within the multimap.  This is not required of unordered_multimap, but libc++ implements this rule anyway to make these two data structures more interchangeable.  And it is this behavior that is causing the dismal performance in this example.

In this example, the unordered_multimap degenerates into 4 very long linked lists:

  size_t nb = map.bucket_count();
  std::cout << nb << '\n';
  for (size_t b = 0; b < nb; ++b)
  {
       size_t bs = map.bucket_size(b);
       if (bs != 0)
          std::cout << b << " : " << bs << '\n';
  }

205759
0 : 30864
1 : 30864
2 : 30864
3 : 30864

And so each insertion is traversing one of these four lists to insert at the end.  A hash-based container is ill-suited for this type of data, and therefore the libc++ unordered containers have not been optimized in any way for this case.  A much better data structure would be an array of forward_list, or an array of vectors.

However I understand that this situation can arise accidentally.  There is a way to tell libc++ to insert at the front of these equal ranges instead of at the back:

  for (int i = 0; i < N; ++i) {
    map.insert(map.find(i % 4), {i % 4, i});
  }

I.e. first search for the key.  This will find the beginning of the equal_range, if it exists.  And then the hint is used to insert just before that.  When this small change is made, the time for me drops from 18543ms to 15ms.

Such a strategy could even be switched to at run time by monitoring the max collision (max bucket_size(b)), and switching to it when it exceeds some predetermined threshold.

It is an open question as to whether this extension (for unordered_multimap without hint to insert in the same way that multimap does) is a value to libc++ clients or not.  I know from decades of experience that clients definitely do value the prescribed order in multimap.  Indeed, this is a new feature in C++11, standardized in large part by customer demand.  And so my guess is that people will value the same behavior in unordered_multimap, even though the order between different equal_range's remains unspecified.

I am curious.  Now that you know the cause, motivation, and workaround for this behavior, what are your thoughts?

Comment 2 Howard Hinnant 2013-07-30 11:29:30 PDT

PS:  The libc++ implementation is all about giving you knobs that you can tweak for higher performance and/or better functionality.  By tweaking a few more knobs I can double the speed again:

  std::unordered_multimap<int, int> map(4);
  map.max_load_factor(INFINITY);
  for (int i = 0; i < N; ++i) {
    map.insert(map.find(i % 4), {i % 4, i});
  }

7ms

Comment 3 Gerald Slacik 2013-07-31 10:26:36 PDT

Thanks for your quick answer. The workaround works perfectly fine for me. Although I was not sure if I should set the status to resolved.

I didn't think of it myself because I didn't think it would change anything.

I can understand your motivation for keeping the order having been there myself. If I understood the MS implementation correctly they keep an end pointer to the list and just append; so they keep the order too. (I didn't look at the libstdc++ implementation).

The change has no measurable impact on the other libs, so I can use the same code for all platforms.

The second speedup tip with max_load_factor is not working for me. In my application I have no idea about the final size. And when I tried it the first time, I forgot the number of buckets in the constructor and the performance was as bad as before.

In my application the degenerated case is very rare. It did go unnoticed for some time. In the rare cases where it occurs the response time went from under a second to tens of minutes. And I missed it in my tests.

We are shipping on SLES and the default gcc is very old (4.3). So I switched to clang/libc++ to use C++11 features. After we found this problem we started shipping with gcc 4.6/libstdc++, which causes all sorts of other problems. So I am very happy to switch back to releasing with clang/libc++.

By the way I like the policy for allowing power-of-2 number of buckets. MS always uses power-of-2 number of buckets and I have a case where this can really hurt me. Making it tuneable is a very good idea.

Comment 4 Howard Hinnant 2013-07-31 11:04:22 PDT

Thanks for your feedback Gerald.

Comment 5 Eric Fiselier 2015-08-25 22:01:49 PDT

*** Bug 21275 has been marked as a duplicate of this bug. ***