Tuesday, September 15, 2015

Should you be using something instead of what you should use instead?

The April 2000 C++ Report included an important article by Matt Austern: "Why You Shouldn't Use set—and What to Use Instead." It explained why lookup-heavy applications typically get better performance from applying binary search to a sorted std::vector than they'd get from the binary search tree implementation of a std::set. Austern reported that in a test he performed on a Pentium III using containers of a million doubles, a million lookups in a std::set took nearly twice as long as in a sorted std::vector.

Austern also reported that the std::set used nearly three times as much memory as the std::vector. On Pentium (a 32-bit architecture), doubles are 8 bytes in size, so I'd expect a std::vector storing a million of them to require about 7.6MB. Storing the same million doubles in a std::set would call for allocation of a million nodes in the underlying binary tree, and assuming three 4-byte pointers per node (pointer to left child, pointer to right child, pointer to parent), each node would take 20 bytes. A million of them would thus require about 19MB. That's only 2.5 times the memory required for the std::vector, but in 2000, I believe it was fairly common for each dynamic memory allocation to tack on a word indicating the size of the allocated storage, and if that was the case in Austern's test, each tree node would require 24 bytes—precisely three times the memory needed to store a single double in a std::vector.

The difference in memory utilization explains why searching a sorted std::vector can be faster than searching a set::set holding the same data. In a std::vector, the per-element data structure overhead present in a search tree is missing, so more container elements fit onto a memory page or into a cache line. We'd thus expect fewer page faults and/or cache misses when looking things up in a std::vector, and faster memory accesses means faster lookups.

Many people were influenced by Austern's article and by independent experiments bolstering his conclusions. I was among them. Item 23 of Effective STL is "Consider replacing associative containers with sorted vectors." Boost was convinced, too: Boost.Container offers the flat_(multi)map/set associative containers, citing as inspiration both Austern's article and the discussion of the sorted std::vector-based AssocVector in Andrei Alexandrescu's Modern C++ Design. (In his book, Alexandrescu references the C++ Report article. All roads in this area lead to Matt Austern.)

There's nothing in the article about containers based on hash tables (i.e., the standard unordered containers), presumably because there were no hash-table-based containers in the C++ standard library in 2000. Nevertheless, the same basic reasoning would seem to apply. Hash tables are based on nodes, and node-based containers incur overhead for pointers and dynamic allocations that std::vectors avoid.

On the other hand, hash tables offer O(1) lookup complexity, while sorted std::vectors offer only O(lg n). This suggests that there should be a point at which the more conservative memory usage of a std::vector is compensated for by the superior computational complexity of an unordered container.

After a recent presentation I gave discussing these issues, Paul Beerkens showed me data he'd collected showing that on simple tests, the unordered containers (i.e., hash tables) essentially always outperformed sorted std::vectors for container sizes beyond about 50. I was surprised that the crossover point was so low, so I did some testing of my own. My results were consistent with his. Here's the data I got for containers between size 20 and 1000; the X-axis is container size, the Y-axis is average lookup time:
The lines snaking across the bottom of the graph (click the image to see a larger version) are for the unordered containers. The other lines are for the tree-based containers (std::set and std::map) and for Boost's flat containers (i.e., sorted std::vectors). For containers in this size range, the superiority of the hash tables is clear, but the advantage of the flat containers over their tree-based counterparts isn't. (They actually tend to be a bit slower.) If we bump the maximum container size up to 20,000, that changes:
Here it's easy to see that for containers with about 5000 elements or more, lookups in the flat containers are faster than those in the binary trees (though still slower than those in the hash tables), and that remains true for containers up to 10 million elements (the largest I tested):
For very small containers (no more than 100 elements), Beerkens added a different strategy to the mix: linear search through an unsorted std::vector. He found that this O(n) approach performed better than everything else for container sizes up to about 35 elements, a result that's consistent with the conventional wisdom that, for small data sets, linear search runs faster than more complicated hash-based and binary-search-based algorithms.

The graphs above correspond to tests I ran on an Intel Core i7-820 using GCC 5.1.0 and MinGW under 64-bit Windows 7. Optimization was set to -O3. I ran the same tests using Visual C++ 2015's compiler, but for reasons I have yet to investigate, all the timings for the hash tables were zero under that compiler.  I've therefore omitted data for VC++. Interestingly, the code to perform linear searches through unsorted std::vectors took zero time under GCC, though non-zero time under VC++. This is why I show no data comparing lookup speeds for all of  binary search trees, hash tables, sorted std::vectors, and unsorted std::vectors: neither GCC nor VC++ generated non-zero lookup times for all approaches.

Maybe GCC optimized out the loop doing the lookups in the unsorted std::vectors, while VC++ optimized away the loops doing the lookups in the hash tables. Or maybe my test code is flawed. I don't know. Perhaps you can figure it out: here's the test code. (It's based on code Paul Beerkens shared with me, but I made substantial changes, so if there's something wrong with the test code, it's my fault.) Feel free to play around with it. Let me know what you find out, either about the code or about the results of running it.

If Paul Beerkens' and my results are valid and generalize across hardware, compilers, standard library implementations, and types of data being looked up, I'd conclude that the unordered standard associative containers (i.e., hash tables) should typically be the ones to reach for when you need high-performance lookups in an associative container and element ordering is not important. I'd also conclude that for very small containers, unsorted std::vectors are likely to be the way to go.

As for the title of this blog post, it looks like what you should use instead of using std::set (and std::map and their multi cousins) these days is probably a truly "unordered" container: a hash table or, for small containers, an unsorted std::vector.

Scott

UPDATE ON 17 SEPTEMBER

Jonathan Wakely posted source code showing how to replace my Windows-specific timing code with portable standards-conformant C++, and Tomasz Kamiński took my code incorporating it and sent a revised program that (1) uses the modify-a-volatile trick to prevent compilers from optimizing loops away and (2) checks the result of the lookup against the container's end iterator (because, in practice, you'd always need to do that). When I run Kamiński's code, I get non-zero lookup times for all containers for both GCC and VC++.

Here are the results I got for containers of up to 100 elements with GCC:
And here they are for VC++:
With both compilers, linear search through an unsorted std::vector is fastest for very small containers, but it doesn't take more than about 20-50 elements for the hash tables to take the lead.Whether that remains the case across hardware, compilers, standard library implementations, and types of data being looked up, I don't know. (That caveat is present in the original post above, but some commenters below seem to have overlooked it.)

72 comments:

  1. I dunno, Scott. This article seems a bit flat to me.

    I mean, anybody who actually cared about the performance in such matters should have profiled anyway and should have made their own comparison. Anybody who used a particular solution without actually checking to see if the other choices were faster should not be surprised to find that it's not faster.

    Bottom line, I don't really see what the point of this article is. There's no new information here. Just "Profile if you want to go fast", and that's optimization 101 and nothing to do with C++ or container lookups.

    ReplyDelete
  2. @DeadMG, honestly, what kind of response is that? If the article did not benefit you much, just go about your business. Personally, I found this article quite useful. Even if you are busy profiling your own code, it is still useful to have an outside party produce results to compare yours against. What if Scott had produced wildly different results from yours? That would have been info of great interest.

    ReplyDelete
  3. If you go for speed, std::unordered_* have the issue that they are not very cache-friendly. http://incise.org/hash-table-benchmarks.html has some evidence for that. There are other hash table implementations that utilize a more efficient memory layout (std::unordered_* can't do that because of requirements in the standard).

    ReplyDelete
  4. @DeadMG: The point of the article is not "profile if you want to go fast." (The post doesn't even mention profiling.) The point is that, as a general rule of thumb, turning to hash tables for lookup-heavy applications in containers where ordering doesn't matter may represent good design. If your initial choice of data structure yields acceptable performance, there's generally no need profile and optimize.

    From what I hear, many C++ developers reflexively use the C++98 associative containers (e.g., std::set, std::map) without considering any alternatives. Perhaps this post will encourage them to broaden their horizons.

    ReplyDelete
  5. @Stefan:
    I would argue that hash maps in general are _more_ cache friendly for large data sets than any kind of binary search.

    For a binary search on a random data point in large data set, you will get O(lg(n)) cache misses. Every node or chunk of vector memory will require access to main memory. The op< calls may also require dragging additional data in from main memory, if the op< requires spanning multiple cache lines.

    For a hash table, you calculate the hash on the search key. This should be a cache hit. Then you get a cache miss when looking for the bucket. Then you have ~load_factor() misses for each item in the bucket if you don't find the item, and ~load_factor()/2 misses if you do find the item. You also may get cache misses for op== if that is expensive.

    So binary search gets lg(n) cache misses, hash table gets 1+load_factor() cache misses.

    Note that if the searches aren't random, then a binary search may be better. lg(n) cache hits could very well be faster than O(1) cache misses.

    All that being said, it is entirely possible that std::unordered_map isn't the best way to do a hash map, but I think it will still beat out an ordered map most of the time.

    ReplyDelete
  6. @Stefan: Thanks for the link. Am I overlooking something, or did your testing not include lookups? I see graphs only for insertions and deletions.

    ReplyDelete
    Replies
    1. It's not my testing, I linked someone elses work - I can't publish my data, but for lookups the picture is similar. The problem is that the standard essentially requires you to implement the buckets as linked lists, so if you have a bucket with many elements you incur lots of cache misses while searching for "your" element. An alternative is to use an open addressing scheme where you scan a (depending on the scheme) continuous area in memory.

      Delete
  7. @Stefan: One could argue, I think, that if you have a bucket with many elements, your table is too small or your hash function needs work :-) Nevertheless, if the data show that other hash table designs yield better performance, possibly at the expense of not adhering to the C++ standard, that's useful information.

    ReplyDelete
  8. @Ben, I wasn't arguing sorted array vs unordered_*, I was merely pointing out that I'd say that if you truly care about speed (and have small keys like integers), unordered_* might not be what you want.

    And as a side note, for the binary search case an array that is truly sorted is not optimal as - as you pointed out - you jump around a lot. However, the search pattern in binary search is very predictable, so you can reorder your elements so that you get elements that will be tested in nearby locations (e.g. start with the data point at the 50% quantile, then the one at 25%, then 50% ...). Yet better - instead of a binary search do a k-ary search with k chosen to suit your cache line size. http://cglab.ca/~morin/misc/arraylayout/ has some data for that (also not my site ;))

    ReplyDelete
  9. @Scott: If you have a max load factor of 2, and a table of ~180000 elements, you probably have a few buckets with 7 elements.

    ReplyDelete
  10. @Ben Craig: Fair enough, but unless you're hitting these buckets repeatedly, it's hard to imagine that time spent searching them would be a bottleneck. Though, now that I think about it, if you're hitting them repeatedly, that would explain why they're overly full...

    ReplyDelete
  11. It would be interesting to include google::dense_hash_map as well.

    ReplyDelete
  12. A few more comparisons: http://stackoverflow.com/questions/21166675/boostflat-map-and-its-performance-compared-to-map-and-unordered-map

    ReplyDelete
  13. Wow, what a coincidence!
    This is exactly the same thing I was playing around with the last couple of days!

    Can confirm all your results in VC15 ( strangely I didn't receive zero values for my unordered_map times, though that might be a result of my structuring of the tests. )

    Thanks for the write-up!

    ReplyDelete
  14. @TP: Could you please download my source code and see if you get non-zero times for the std::unordered containers with VC15?

    ReplyDelete
  15. Hi Scott, I gave it a try and I get non-zero times for unordered containers. I tried both x86 and x64 in Release mode, VS 2015 Enterprise.

    ReplyDelete
  16. I received something like this for VS2015:
    https://gist.githubusercontent.com/Predelnik/a035755116ccb4c6d7a1/raw/3075651ffe63f10763d818927991cb62906412a5/MS_lookupTimings.txt
    Looks pretty similar to graphs for gcc.

    ReplyDelete
  17. The code also aborts in Debug mode because your uniform distribution uses ints but your key type is unsigned long and they don't have the same range.

    ReplyDelete
  18. It's also worth noting that, because of the way test data is generated, this mostly tests the failed lookup case.

    ReplyDelete
  19. It's also very hard to say how meaningful these results are because I've seen bogus numbers that I assume are related to optimizer doing its thing. I'd like to include results for llvm::DenseMap. Current results I have overlap with the horizontal axis which is too good to be true. I'll look into it a bit more tomorrow.

    ReplyDelete
  20. @Ben Craig: Regarding binary search cache-friendliness: http://bannalia.blogspot.com.au/2015/06/cache-friendly-binary-search.html

    ReplyDelete
  21. What about google sparse/dense map/set?

    ReplyDelete
  22. @Matt Vogt, @Stefan:
    The level ordering and B-tree side of things is really useful, but with a large enough data set, it still has cache issues. Going with a level-ordered vector will take you from lg(n) misses to k hits and lg(n)-k misses. With a "medium sized" data set (whatever that is), you will be able to beat the 1+load_factor() misses that an unordered container has, but eventually O(1) beats out O(lg(n)). I do think it would be awesome if there were some standard algorithms / predicates that made it easier to create and use a level-ordered list though.



    @Scott Meyers:
    The number of elements in a bucket follows a Poisson distribution. With 200,000 elements, and 2^17 buckets (load factor ~ 1.5), you will typically have ~3.5 buckets with 9 elements in them. You will also have ~21% of your buckets still empty. This is all with a perfectly uniform hash. It's a lot like the 'birthday paradox', but on a larger scale. If you want to tinker around with it in Excel, then do something like "=POISSON.DIST(, /, FALSE)". That will give you the percentage of buckets with the given chain length.

    Now, if you are continually searching for different elements, the big buckets will get averaged against the buckets with zero and one elements, and you end up with a number of lookups proportional to the load factor. If you pick one element and look it up a lot though, then you do have a chance to get unlucky. You can also get unlucky if you look for something that isn't present frequently, and the corresponding bucket is overly full.

    ReplyDelete
  23. Sorry for the late reply,
    just ran your code through, getting non-zero times for the std::unordered containers on both x86 and x64 with full optimizations enabled.

    ReplyDelete
  24. Here's a portable version of the timer class using std::chrono:

    https://gist.github.com/jwakely/2ca40c29c14dd6e30132

    As stated above, you have undefined behaviour in randomValue()

    ReplyDelete
  25. @Nikola Smiljanić and Predelnik: Thanks for the information that my code produces non-zero values for you. I wonder if the problem for me is my optimization level (which is /Ox).

    ReplyDelete
  26. @Nikola Smiljanić: Thanks for pointing out the bug in my use of std::uniform_int_distribution. I'll fix it and upload revised source code, hopefully later today.

    ReplyDelete
  27. @Philip Deegan: My goal wasn't to test a wide range of data structures, it was primarily to revisit the data structures Matt Austern discussed in light of the std::unordered_* containers that were added to C++ as of C++11. Furthermore, my testing was elementary (as some other commenters have noted). Think of the blog post more as food for thought than as a comprehensive or rigorous technical investigation into how to truly optimize lookup performance.

    Having said that, I welcome comments (including experience reports) about other data structures and other lookup strategies. Just don't expect me to pursue them or to update the post as a result :-)

    ReplyDelete
  28. For those interested, some time ago I measured the performance of several C++ closed-addressing hash containers on scenarios involving lookup, insertion and deletion:

    * MSVC (Dinkumware, Boost.Unordered, Boost.MultiIndex)
    * GCC ( libstdc++-v3, Boost.Unordered, Boost.MultiIndex)
    * Clang ( libstdc++-v3, libc++, Boost.Unordered, Boost.MultiIndex)

    For sheer lookup speed, open addressing is probably faster, but alas C++ requirements on unordered associative containers basically dictate closed addressing.

    ReplyDelete
  29. I actually stumbled upon the cited paper just yesterday. My personal problem with the unordered containers is that it is so unnecessary complicated to implement hashes for custom classes *and* they are missing for many standard types. I sincerly hope that proposals N3980 / N3983 will help with that regard. It's not only about performance, it's also about productivity and maintanability of the code.

    Regarding those benchmarks, the discussions about /for me on this system for this compiler/ look somewhat antiquated. Is there a webservice where you can post code snippets and it gives you performance charts for a range of compilers / options / systems? A bit like continous integration, benchmarking meets ideone.

    ReplyDelete
  30. One interesting topic might be to see how open addressing hash tables perform. The supposed to be a more cache friendly version of the node based hash tables that are in the standard already. There is a game development/low latency study group (I do not remember their exact name) who explores the possibility of adding such (open addressing) data structures to the STL.

    ReplyDelete
  31. @Zulan: If there's a benchmarking web site of the kind you describe, I'm not familiar with it. I agree that it'd be great if one existed.

    ReplyDelete
  32. @Jonathan Wakely: I've revised the test code to incorporate your platform-independent timer, as well as to fix the problem with std::uniform_int_distribution that Nikola Smiljanić reported above. The article now links to the new code, though the original code remains where it used to be.

    Your code leads me to be confused about the way std::chrono::high_resolution_clock::duration::period is specified. It's hardwired to be 1ns, but std::chrono::high_resolution_clock uses Windows' QueryPerformanceCounter, and unless there was a bug in my earlier code, one tick of that counter corresponds to 592ns on my machine. I'm having trouble understanding how std::chrono::high_resolution_clock can offer higher timing resolution than the hardware on my machine is capable of providing. I would have expected std::chrono::high_resolution_clock::period on my machine to be something like std::ratio<592,1000000000>. Any idea where I've gone wrong?

    ReplyDelete
  33. Hi Scott, should you be amending the title to read: "Should you be using something instead of what you currently use" as the current titles meaning is hard to discern with its use of the word "instead" twice in the sentence.

    Also, you get the capitalisation a little off in the first link to Matts article putting a capital W on the word "what".

    ReplyDelete
  34. Hi Scott,

    I think another parameter which should be included in the analysis / recommendation is the time which is spent computing hash values. I have seen cases with strings used as key types where std::map was much faster than a hash map, because most comparisons stop after comparing a couple of characters, but the hash function has to process the whole string. Maybe it would interesting to re-run your benchmarks with std::strings (or fixed sized static strings to prevent cache misses) of different length as keys?

    ReplyDelete
  35. @Scott Joaquín López Muñoz has a series of benchmarking articles that you might be interested in.

    ReplyDelete
  36. I'm sorry but when I see "article" titled as yours I fear cheap. And sure enough, as pointed out in other comments, it is yet again, elementary/basic, simplistic and *incorrect* explanation on very important subject. But am afraid that will always be the case, when somebody (Scott Meyers) who doesn't have *any* *recent* industry experience tries to teach professionals. I'm sorry, I don't want to sound harsh, nor I have anything personally against Scott Meyers, but for me somebody without *recent* that is at most 3-5 years of not working professionally as a programmer, industry experience should/must not be taken seriously when talking about programming on more advanced topics, which he knows only from very let's say *academic* experience.

    ReplyDelete
  37. I think the article itself is fine (though the title is incorrect; the last "instead" shouldn't be there). I'm just surprised that a prominent C++ programmer, in 2015, didn't know this.

    This is the reason why flat_set/map exist in Boost. They're just std::vector-based implementations of sets/maps. This has been known about for years, and has had several articles on it, as well as good Stack Overflow questions/answers. There have even been proposals for including something like flat_set/map in the standard.

    So I believe this is fairly well known already. So I think the best this article would do is to reach those dark corners of C++-dom that aren't aware of it, via Scott Meyers's popularity.

    ReplyDelete
  38. @Paddy3118: The title of the blog post is a riff on the title of Austern's article. If I just wanted a straightforward title for it, it'd be something like "Trees, vectors, hash tables, and lookups."

    The capitalization I used for Austern's article title matches what was published in C++ Report (per the article image accompanying the blog post).

    ReplyDelete
  39. @jensa: Before coming to any general conclusions, it would certainly be both interesting and worthwhile to test a variety of key type, including strings. That's why I wrote "If Paul Beerkens' and my results are valid and generalize across hardware, compilers, standard library implementations, and types of data being looked up..."

    I'll reiterate what to Philip Deegan: My goal wasn't to test a wide range of data structures, it was primarily to revisit the data structures Matt Austern discussed in light of the std::unordered_* containers that were added to C++ as of C++11. Furthermore, my testing was elementary (as some other commenters have noted). Think of the blog post more as food for thought than as a comprehensive or rigorous technical investigation into how to truly optimize lookup performance.

    ReplyDelete
  40. @Nicol Bolas: One of the tentative conclusions the data point to is that there are relatively few uses cases for sorted std::vector-based containers (i.e., flat containers). The data suggest that for lookup speed, they're generally inferior to unsorted std::vectors for small container sizes and to std::unsorted containers for larger container sizes. For mixtures of insertions, erasures, and lookups, they're inferior to both tree-based and hash-table-based containers. That was the primary surprise to me: that flat containers may have very limited utility--much more limited that Austern's article suggests.

    Do you believe this is fairly well known?

    ReplyDelete
  41. @Nicol Bolas,
    As I've said before I have nothing against Mr Scott Meyers, but I don't think he should be referred to as a programmer. And prominent? C'mon. Most suitable term would be let's say (I'm just really spinning ball here) a tutor? I don't know, but to the best of my knowledge Mr Scott Meyers doesn't do any *industrial* programming so I would not call him a programmer.
    Like I've said, there is nothing personal, but he is not a programmer.
    Would you call a tutor who never takes parts/programms in any commercial projects a programmer, where all he does is he goes through the material in books? No you wouldn't. You would call him a tutor.
    John Carmack comes to mind where prominent programmers are to be listed.
    But Mr Scott Meyers? Nope, tutor. Because that's what he does.

    ReplyDelete
  42. @Scott Meyers
    "The data suggest that for lookup speed, they're generally inferior to unsorted std::vectors for small container sizes and to std::unsorted containers for larger container sizes. For mixtures of insertions, erasures, and lookups, they're inferior to both tree-based and hash-table-based containers. That was the primary surprise to me:"

    Really? Seriously, I don't want to sound arrogant but I'm pretty certain that that kind of conclusion/suspicion should come even before any time testing based on the rather basic knowledge about how insertions/deletions etc work in vector as opposed to other mentioned here containers? But if it comes as a surprise? I really don't know what to think about it.

    ReplyDelete
  43. @Knowing me, knowing you, a-ha: I'd appreciate it if, when quoting me about being surprised, you don't omit the text identifying the surprise. It wasn't that insertions and erasures in std::vectors are slower than in associative containers. It was "that flat containers may have very limited utility--much more limited that Austern's article suggests."

    ReplyDelete
  44. @Scott Meyers
    Sorry about that, fair enough.
    But please could you answer:
    Would you consider yourself a *professional* programmer?

    ReplyDelete
  45. I find I prefer sorted containers over unordered containers for two reasons:

    1. you are often able to write more efficient algorithms on sorted data sets.
    E.g. computing the union or intersection of two sets of elements

    2. I want output that is both predictable and deterministic.
    Just changing the size of an internal container or the insertion order of elements should not change a program's output.
    (obviously less of an issue when used internally)

    For larger datasets where memory is tight the load factor is a performance killer.
    However, there has been some work on producing minimal perfect hashes in O(n log n) or slightly better which would be very useful for applications that are construct once and lookup many times.

    ReplyDelete
  46. @Knowing me, knowing you, a-ha: I've been on record for many years as not being a professional programmer. For example, in this 2006 article, I wrote:


    I’ll begin with what many of you will find an unredeemably damning confession: I have not written production software in over 20 years, and I have never written production software in C++. Nope, not ever. Furthermore, I’ve never even
    tried to write production software in C++, so not only am I not a real C++ developer, I’m not even a wannabe.


    A bit later in that article, I write:


    Fundamentally, I study C++ and its application. I gather as much information as I can about the language and its use (from books, magazines, newsgroups, email and face-to-face conversations with developers and members of the standardization committee, experiments with toy programs I write, etc.), organize and analyze what I find, then I package my findings in concentrated form (e.g., books, magazine articles, technical presentations, etc.) for consumption for people like you—people who
    do use the language. Your job is to employ C++ as a tool to write useful software. My job is to discover and package the information you need to best apply that tool.

    ReplyDelete
  47. Obviously, this analysis completely leaves out insertion speed, deletion speed, and turn-around time, and whether or not these are important depends on your workload.

    One of the common "use cases" for sets is when you need to process a bunch of things which may have duplicates, but ensure that you only process them once. The turn-around time between insert and lookup is crucial here, so even sorted arrays don't work well in this case.

    But while I'm the topic, I'd like to point out that every single std::unordered container that I've used in C++ have a performance bottleneck which makes them difficult to use well in some scenarios: the prime hash policy.

    It seems that a lot of peoples' knowledge about hashing dates from somewhere between the 1970s and the 1990s. Back in the day, the advice was to use a cheap-but-reasonable hash function, then make your hash tables a prime size. On modern pipelined CPUs, this is terrible advice.

    The performance gap between division and all other arithmetic operations has widened considerably; modern multipliers are fully pipelined and can issue and retire one multiply per cycle. Division, on the other hand, is still a high-latency low-throughput operation. This means that the "divide by a prime" operation can be a bottleneck.

    And then there's the advent of SIMD instructions. Today, it's almost as cheap to compute two or even four hash functions simultaneously as it is to compute one. This means that hash techniques which use multiple hash functions (e.g. double hashing, cuckoo hashing) are extremely competitive.

    What I guess I'm trying to say is that if your set implementation is a bottleneck, you probably should prepare yourself for the fact that the standard containers won't help you.

    Oh, and don't forget Boost's intrusive containers, which certainly help with memory usage.

    ReplyDelete
  48. "Your job is to employ C++ as a tool to write useful software. My job is to discover and package the information you need to best apply that tool."
    First of all, please don't tell me what my job is. Instead let me tell you what really my job is:
    My real (that meaty and what really matters at the end job) is to write/maintain real programs, and that what you've described as "your" job, I (and most of the *real* programmers) do that so called "your job" during breakfast, lunch and spare free time. You surely are not that arrogant and you surely don't think that real programmers can't think for them - self and don't do their research, don't learn new things and discover new and important things on their own, without your or people like you help? Every real programmer does it. But we do it (that what you called "your" job) as I've just said, in our **free** **spare** time.

    Now let me tell you what your job is:
    You've invented job for yourself, job in where you do something for living what real programmers do during their lunch (and other) breaks.

    Programming is very practical, hands on task profession, and without practical knowledge of that subject you (as you've proven in this article and others), you have really no real/deep knowledge of what you're talking about. Not only that, you give very often incorrect and misleading information. Why? Because the sad true is that you don't have practical knowledge of what you're talking about.

    I tell you who you are:
    An amateur, snake-oil salesman and good people (especially beginning programmers) should be warn against you.
    I personally have no respect for you as a professional, because you're not a professional.

    ReplyDelete
  49. @Knowing me, knowing you, a-ha: you need to move on to something else. Many of us appreciate and learn from the work Scott does. If you have nothing left to learn, or you can get your knowledge elsewhere, great. Scott: I've been a software developer for decades, and have no shame admitting that my career as a C++ developer was given a significant boost (no pun) when I came across your first book in the mid-1990s. I own your books, have been to 3 of your workshops/presentations, and continue to look forward to learning.

    ReplyDelete
  50. @Unknown: Thanks for your kind words. I'm glad you've found my work helpful.

    ReplyDelete
  51. This comment has been removed by a blog administrator.

    ReplyDelete
  52. This comment has been removed by a blog administrator.

    ReplyDelete
  53. This comment has been removed by a blog administrator.

    ReplyDelete
  54. This comment has been removed by a blog administrator.

    ReplyDelete
  55. This comment has been removed by a blog administrator.

    ReplyDelete
  56. This comment has been removed by a blog administrator.

    ReplyDelete
  57. This comment has been removed by a blog administrator.

    ReplyDelete
  58. This comment has been removed by a blog administrator.

    ReplyDelete
  59. Recent comments on this blog have focused on whether I'm qualified to offer advice on C++ programming and, more generally, on what kinds of qualifications somebody offering such advice should have. Those are legitimate topics for discussion, but they're not related to this blog post. I've therefore removed all comments of that nature except for the first few. (I'm retaining the first few so as not to pretend that the discussions didn't occur. I'm not an enthusiast of revisionist history.) To debate who should be giving advice about C++, please find a different forum.

    I've changed my comment moderation policy so that I must approve new comments before they appear. This means that comments you leave on my blog may not show up for a while. My normal moderation policy is to screen only comments for blog entries that are at least 30 days old. I'll reinstate that policy when this blog entry reaches that point (on October 15).

    I continue to welcome comments about the substance of any of my blog posts, even if the comments are critical.

    As for feedback on my decision to remove some comments pertaining to this blog post and to temporarily impose a stricter moderation policy, I will not permit those to be published, regardless of whether they applaud or condemn my decision. To get a comment posted, you'll need to write about something related to the lookup performance of different containers in the Standard Library.

    ReplyDelete
  60. Scott, as a suggestion for a possible sequel to this interesting installment on container performance, I'd be interested in your take on various vector optimizations and variations. There's array (C-style or std::), there is Hinnant's short_allocator, Boost/Facebook Folly have small_vector (small string optimization for vector, non-conformant) and static_vector (push_back and other things with fixed array memory capacity), and there's the still zombie-like proposal for dynarray.

    ReplyDelete
  61. @Rein Halbersma: Thanks for the suggestion, but I'll let others tackle that topic. It's an area I haven't looked into very much, sorry.

    ReplyDelete
  62. @Scott Meyers I had the same thought as @jensa. Intuitively, running a hash function on two relatively long strings and comparing the hashes should be slower than comparing the two strings lexicographically and looking for the first mismatch. However, it seems this is a wrong intuition. I did a slight modification to your program by changing the keys to be random-length strings of up to 1000 characters with characters drawn randomly from a set of 93 different symbols. The resulting graphs on Clang 3.6, Intel Core 2 Duo looks similar to yours.

    Gist: https://gist.github.com/hgad/7178e03fcdfaa571f5de
    Graph: https://www.pinterest.com/pin/317574211202963438/

    ReplyDelete
  63. @Haitham Gad: I didn't check your source code, but it seems to me that if you chose truly random strings, most would likely differ in the first one or two characters, so a comparison would not have to look at very much of each string. Strings selected by humans (e.g., names) are likely to be much less random than truly random strings of characters, I think. But that's just speculation on my part.

    ReplyDelete
  64. @Scott Meyers Exactly, but that's a point in favor of sorted containers (which perform lexicographical comparisons). Hash functions would still have to scan the full string to calculate the hash. Yet, unordered containers are still faster according to the graph.

    ReplyDelete
  65. @Haitham Gad
    When doing a lookup in an unordered map, you only hash the item that is being looked up. That hash directs you to the appropriate bucket. Once you are in the correct bucket, op== is called on each collision. You generally don't hash the target item during a lookup.

    So for a successful lookup, you end up hashing one string that is likely already in the cache, and then you do one full op== operation on the successful item, with some number of unsuccessful op== operations for the collisions.

    Also note that for strings that are too long to hit the small string optimization, you are likely to get even more cache misses when looking at the characters in ordered containers.

    ReplyDelete
  66. @Ben Craig I understand, but in my modification of Scott's code the lookup value is also a random-length string of up to 1000 chars. I understand that both Clang & GCC use a variation of the murmur hash which operates 4 bytes at a time, so assuming a lookup value 500 chars in length on average, that would be around 125 operations per lookup ignoring collisions. The largest container Scott's code tests is 100,000 elements, so a sorted container should be able to find the right element in about 16 lexicographical comparisons. Assuming the lexicographical comparison stops at say the 4th character on average (assuming keys are sufficiently random), that's about 64 operations. Add to that the fact that sorted container lookup operations are simple operator<() comparisons, while a hash function operation is more complicated.

    If anything could explain this, it's probably your point that sorted containers have to bring every element to the cache to compare it with the key, while unsorted containers operates solely on the key. That could probably explain why they're more efficient in the case of string keys too.

    ReplyDelete
  67. Two particular problems with unordered containers which I have not seen coming up in performance discussions are construction time and memory footprint. For (many) relatively short-lived objects aggregating such containers, unordered ones are - in my experience - the least favorable choice. The numbers here suggest universal applicability of unordered sets although they come with real and hard disadvantages.

    ReplyDelete