Tuesday, September 15, 2015

Effective Modern C++ in Polish!

The family of Effective Modern C++ translations continues to grow. The latest member is in Polish.

As with all the translations I've seen so far, the Polish edition uses only one ink color (black). I therefore believe that if you're comfortable with technical English, you'll probably prefer the English (American) edition. If you prefer your C++ in Polish (including code comments!), however, I'm pleased to report that you now have that option.


Should you be using something instead of what you should use instead?

The April 2000 C++ Report included an important article by Matt Austern: "Why You Shouldn't Use set—and What to Use Instead." It explained why lookup-heavy applications typically get better performance from applying binary search to a sorted std::vector than they'd get from the binary search tree implementation of a std::set. Austern reported that in a test he performed on a Pentium III using containers of a million doubles, a million lookups in a std::set took nearly twice as long as in a sorted std::vector.

Austern also reported that the std::set used nearly three times as much memory as the std::vector. On Pentium (a 32-bit architecture), doubles are 8 bytes in size, so I'd expect a std::vector storing a million of them to require about 7.6MB. Storing the same million doubles in a std::set would call for allocation of a million nodes in the underlying binary tree, and assuming three 4-byte pointers per node (pointer to left child, pointer to right child, pointer to parent), each node would take 20 bytes. A million of them would thus require about 19MB. That's only 2.5 times the memory required for the std::vector, but in 2000, I believe it was fairly common for each dynamic memory allocation to tack on a word indicating the size of the allocated storage, and if that was the case in Austern's test, each tree node would require 24 bytes—precisely three times the memory needed to store a single double in a std::vector.

The difference in memory utilization explains why searching a sorted std::vector can be faster than searching a set::set holding the same data. In a std::vector, the per-element data structure overhead present in a search tree is missing, so more container elements fit onto a memory page or into a cache line. We'd thus expect fewer page faults and/or cache misses when looking things up in a std::vector, and faster memory accesses means faster lookups.

Many people were influenced by Austern's article and by independent experiments bolstering his conclusions. I was among them. Item 23 of Effective STL is "Consider replacing associative containers with sorted vectors." Boost was convinced, too: Boost.Container offers the flat_(multi)map/set associative containers, citing as inspiration both Austern's article and the discussion of the sorted std::vector-based AssocVector in Andrei Alexandrescu's Modern C++ Design. (In his book, Alexandrescu references the C++ Report article. All roads in this area lead to Matt Austern.)

There's nothing in the article about containers based on hash tables (i.e., the standard unordered containers), presumably because there were no hash-table-based containers in the C++ standard library in 2000. Nevertheless, the same basic reasoning would seem to apply. Hash tables are based on nodes, and node-based containers incur overhead for pointers and dynamic allocations that std::vectors avoid.

On the other hand, hash tables offer O(1) lookup complexity, while sorted std::vectors offer only O(lg n). This suggests that there should be a point at which the more conservative memory usage of a std::vector is compensated for by the superior computational complexity of an unordered container.

After a recent presentation I gave discussing these issues, Paul Beerkens showed me data he'd collected showing that on simple tests, the unordered containers (i.e., hash tables) essentially always outperformed sorted std::vectors for container sizes beyond about 50. I was surprised that the crossover point was so low, so I did some testing of my own. My results were consistent with his. Here's the data I got for containers between size 20 and 1000; the X-axis is container size, the Y-axis is average lookup time:
The lines snaking across the bottom of the graph (click the image to see a larger version) are for the unordered containers. The other lines are for the tree-based containers (std::set and std::map) and for Boost's flat containers (i.e., sorted std::vectors). For containers in this size range, the superiority of the hash tables is clear, but the advantage of the flat containers over their tree-based counterparts isn't. (They actually tend to be a bit slower.) If we bump the maximum container size up to 20,000, that changes:
Here it's easy to see that for containers with about 5000 elements or more, lookups in the flat containers are faster than those in the binary trees (though still slower than those in the hash tables), and that remains true for containers up to 10 million elements (the largest I tested):
For very small containers (no more than 100 elements), Beerkens added a different strategy to the mix: linear search through an unsorted std::vector. He found that this O(n) approach performed better than everything else for container sizes up to about 35 elements, a result that's consistent with the conventional wisdom that, for small data sets, linear search runs faster than more complicated hash-based and binary-search-based algorithms.

The graphs above correspond to tests I ran on an Intel Core i7-820 using GCC 5.1.0 and MinGW under 64-bit Windows 7. Optimization was set to -O3. I ran the same tests using Visual C++ 2015's compiler, but for reasons I have yet to investigate, all the timings for the hash tables were zero under that compiler.  I've therefore omitted data for VC++. Interestingly, the code to perform linear searches through unsorted std::vectors took zero time under GCC, though non-zero time under VC++. This is why I show no data comparing lookup speeds for all of  binary search trees, hash tables, sorted std::vectors, and unsorted std::vectors: neither GCC nor VC++ generated non-zero lookup times for all approaches.

Maybe GCC optimized out the loop doing the lookups in the unsorted std::vectors, while VC++ optimized away the loops doing the lookups in the hash tables. Or maybe my test code is flawed. I don't know. Perhaps you can figure it out: here's the test code. (It's based on code Paul Beerkens shared with me, but I made substantial changes, so if there's something wrong with the test code, it's my fault.) Feel free to play around with it. Let me know what you find out, either about the code or about the results of running it.

If Paul Beerkens' and my results are valid and generalize across hardware, compilers, standard library implementations, and types of data being looked up, I'd conclude that the unordered standard associative containers (i.e., hash tables) should typically be the ones to reach for when you need high-performance lookups in an associative container and element ordering is not important. I'd also conclude that for very small containers, unsorted std::vectors are likely to be the way to go.

As for the title of this blog post, it looks like what you should use instead of using std::set (and std::map and their multi cousins) these days is probably a truly "unordered" container: a hash table or, for small containers, an unsorted std::vector.



Jonathan Wakely posted source code showing how to replace my Windows-specific timing code with portable standards-conformant C++, and Tomasz Kamiński took my code incorporating it and sent a revised program that (1) uses the modify-a-volatile trick to prevent compilers from optimizing loops away and (2) checks the result of the lookup against the container's end iterator (because, in practice, you'd always need to do that). When I run Kamiński's code, I get non-zero lookup times for all containers for both GCC and VC++.

Here are the results I got for containers of up to 100 elements with GCC:
And here they are for VC++:
With both compilers, linear search through an unsorted std::vector is fastest for very small containers, but it doesn't take more than about 20-50 elements for the hash tables to take the lead.Whether that remains the case across hardware, compilers, standard library implementations, and types of data being looked up, I don't know. (That caveat is present in the original post above, but some commenters below seem to have overlooked it.)

Thursday, September 10, 2015

Interview with me on CppCast

I've been a loyal listener to CppCast since its launch earlier this year, so I was pleased to be asked to be a guest on the show. The result is now live. Among other things, hosts Rob Irving and Jason Turner asked me about my recent blog post on inconsistencies in C++ intialization syntax (their idea, not mine), my role as Consulting Editor for the Effective Software Development Series, common misconceptions C++ developers have about the workings of the language, Items from my books I consider especially noteworthy, advice to would-be authors and presenters, aspects of C++ I'd prefer didn't exist, and how I found myself lecturing about C++ from a nightclub stage normally used for belly dancing.

It was a fun interview for me, and I hope you enjoy listening to it.


Monday, September 7, 2015

Thoughts on the Vagaries of C++ Initialization

If I want to define a local int variable, there are four ways to do it:
int x1 = 0;
int x2(0);
int x3 = {0};
int x4{0};
Each syntactic form has an official name:
int x1 = 0;              // copy initialization
int x2(0);               // direct initialization
int x3 = {0};            // copy list initialization
int x4{0};               // direct list initialization
Don't be misled by the word "copy" in the official nomenclature. Copy forms might perform moves (for types more complicated than int), and in practice, implementations often elide both copy and move operations in initializations using the "copy" syntactic forms.

(If you engage in written communication with a language lawyer about these matters and said lawyer has its pedantic bit set, you'll be reprimanded for hyphen elision. I speak from experience. The official terms are "copy-initialization," "direct-initialization," "copy-list-initialization," and "direct-list-initialization." When dealing with language lawyers in pedantic mode, it's wise to don a hazmat suit or to switch to oral communication.)

But my interest here isn't terminology, it's language design.

Question #1: Is it good language design to have four ways to say the same thing?

Let's suppose that instead of wanting to define an int, we want to define a std::atomic<int>. std::atomics don't support copy initialization (the copy constructor is deleted), so that syntactic form becomes invalid. Copy list initialization continues to succeed, however, because for std::atomic, it's treated more or less like direct initialization, which remains acceptable. So:
std::atomic<int> x5 = 0;    // error!
std::atomic<int> x6(0);     // fine
std::atomic<int> x7 = {0};  // fine
std::atomic<int> x8{0};     // fine
(I frankly expected copy list initialization to be treated like copy initialization, but GCC and Clang thought otherwise, and [over.match.list] in C++14 backs them up. Live and learn.)

Question #2: Is it good language design to have one of the four syntaxes for defining an int be invalid for defining a std::atomic<int>?

Now let's suppose we prefer to use auto for our variable instead of specifying the type explicitly. All four initialization syntaxes compile, but two yield std::initializer_list<int> variables instead of ints:
auto x9 = 0;                // x9's type is int
auto x10(0);                // x10's type is int
auto x11 = {0};             // x11's type is std::initializer_list<int>
auto x12{0};                // x12's type is std::initializer_list<int>
This would be the logical place for me to pose a third question, namely, whether these type deductions represent good language design. The question is moot; it's widely agreed that they don't. Since C++11's introduction of auto variables and "uniform" braced initialization syntax, it's been a common error for people to accidentally define a std::initializer_list when they meant to define, e.g., an int.

The Standardization Committee acknowledged the problem by adopting N3922 into draft C++17. N3922 specifies that an auto variable, when coupled with direct list initialization syntax and exactly one value inside the braces, no longer yields a std::initializer_list. Instead, it does what essentially every programmer originally expected it to do: define a variable with the type of the value inside the braces. However, N3922 leaves the auto type deduction rules unchanged when copy list initialization is used. Hence, under N3922:
auto x9 = 0;                // x9's type is int
auto x10(0);                // x10's type is int
auto x11 = {0};             // x11's type is std::initializer_list<int>
auto x12{0};                // x12's type is int
Several compilers have implemented N3922. In fact, it can be hard—maybe even impossible— to get such compilers to adhere to the C++14 standard, even if you want them to. GCC 5.1 follows the N3922 rule even when expressly in C++11 or C++14 modes, i.e., when compiled with -std=c++11 or -std=c++14. Visual C++ 2015 is similar: type deduction is performed in accord with N3922, even when /Za ("disable language extensions") is used.

 Question #3: Is it good language design for copy list initialization (i.e., braces plus "=") to be treated differently from direct list initialization (i.e., braces without "=") when deducing the type of auto variables?

Note that these questions are not about why C++ has the rules it has. They're about whether the rules represent good programming language design. If we were designing C++ from scratch, would we come up with the following?
int x1 = 0;                 // fine
int x2(0);                  // fine
int x3 = {0};               // fine
int x4{0};                  // fine
std::atomic<int> x5 = 0;    // error!
std::atomic<int> x6(0);     // fine
std::atomic<int> x7 = {0};  // fine
std::atomic<int> x8{0};     // fine
auto x9 = 0;                // x9's type is int
auto x10(0);                // x10's type is int
auto x11 = {0};             // x11's type is std::initializer_list<int>
auto x12{0};                // x12's type is int
Here's my view:
  • Question #1: Having four ways to say one thing constitutes bad design. I understand why C++ is the way it is (primarily backward-compatibility considerations with respect to C or C++98), but four ways to express one idea leads to confusion and, as we've seen, inconsistency.
  • Question #2: Removing copy initialization from the valid initialization syntaxes makes things worse, because it introduces a seemingly gratuitous inconsistency between ints and std::atomic<int>s.
  • Non-question #3: I thought the C++11 rule about deducing std::initializer_lists from braced initializers was crazy from the day I learned about it. The more times I got bitten by it in practice, the crazier I thought it was. I have a lot of bite marks.
  • Question #3: N3922 takes the craziness of C++11 and escalates it to insanity by eliminating only one of two syntaxes that nearly always flummox developers. It thus replaces one source of programmer confusion (auto + braces yields counterintuitive type deduction) with an even more confusing source (auto + braces sometimes yields counterintuitive type dedeuction). One of my earlier blog posts referred to N2640, where deducing a std::initializer_list for auto variables was deemed "desirable," but no explanation was offered as to why it's desirable. I think that much would be gained and little would be lost by abandoning the special treatment of braced initializers for auto variables. For example, doing that would reduce the number of sets of type deduction rules in C++ from five to four.
But maybe it's just me. What do you think about the vagaries of C++ initialization?