Tuesday, November 17, 2015

The Brick Wall of C++ Source Code Transformation

In 1992, I was responsible for organizing the Advanced Topics Workshop that accompanied the USENIX C++ Technical Conference. The call for workshop participation said:
The focus of this year's workshop will be support for C++ software development tools. Many people are beginning to experiment with the idea of having such tools work off a data structure that represents parsed C++, leaving the parsing task to a single specialized tool that generates the data structure. 
As the workshop approached, I envisioned great progress in source code analysis and transformation tools for C++. Better lints, deep architectural analysis tools, automatic code improvement utilities--all these things would soon be reality! I was very excited.

By the end of the day, my mood was different. Regardless of how we approached the problem of automated code comprehension, we ran into the same problem: the preprocessor. For tools to understand the semantics of source code, they had to examine the code after preprocessing, but to produce acceptable transformed source code, they had to modify what programmers work on: files with macros unexpanded and preprocessor directives intact. That means tools had to map from preprocessed source files back to unpreprocessed source files. That's challenging even at first glance, but when you look closer, the problem gets harder. I found out that some systems #include a header file, modify preprocessor symbols it uses, then #include the header again--possibly multiple times. Imagine back-mapping from preprocessed source files to unpreprocessed source files in such systems!

Dealing with real C++ source code means dealing with real uses of the preprocessor, and at that workshop nearly a quarter century ago, I learned that real uses of the preprocessor doomed most tools before they got off the drawing board. It was a sobering experience.

In the ensuing 23 years, little has changed. Tools that transform C++ source code still have to deal with the realities of the preprocessor, and that's still difficult. In my last blog post, I proposed that the C++ Standardization Committee take into account how source-to-source transformation tools could reduce the cost of migrating old code to new standards, thus permitting the Committee to be more aggressive about adopting breaking changes to the language. In this post, I simply want to acknowledge that preprocessor macros make the development of such tools harder than my last post implied.

Consider this very simple C++:
#define ZERO 0

auto x = ZERO;
int *p = ZERO;
In the initialization of x, ZERO means the int 0. In the initialization of p, ZERO means the null pointer. What should a source code transformation tool do with this code if its job is to replace all uses of 0 as the null pointer with nullptr? It can't change the definition of ZERO to nullptr, because that would change the semantics of the initialization of x. It could, I suppose, get rid of the macro ZERO and replace all uses with either the int 0 or nullptr, depending on context, but (1) that's really outside its purview (programmers should be the ones to determine if macros should be part of the source code, not tools whose job it is to nullptr-ify a code base), and (2) ZERO could be used inside other macros that are used inside other macros that are used inside other macros..., and especially in such cases, reducing the macro nesting could fill the transformed source code with redundancies and make it harder to maintain. (It'd be the moral equivalent of replacing all calls to inline functions with the bodies of those functions.)

I don't recall a lot of talk about templates at the workshop in 1992. At that time, few people had experience with them. (The first compiler to support them, cfront 3.0, was released in 1991.) Nevertheless, templates can give rise to the same kinds of problems as the preprocessor:
template<typename T>
void setToZero(T& obj) { obj = 0; }

int x;
setToZero(x);    // "0" in setToZero means the int

int *p;
setToZero(p);    // "0" in setToZero means the null pointer
I was curious about what clang-tidy did in these situations (one of its checks is modernize-use-nullptr), but I was unable to find a way to enable that check in the version of clang-tidy I downloaded (LLVM version 3.7.0svn-r234109). Not that it matters. The way that clang-tidy approaches the problem isn't the only way, and one of the reasons I propose a decade-long time frame to go from putting a language feature on a hit list to actually getting rid of it is that it's likely to take significant time to develop source-to-source translation tools that can handle production C++ code, macros and templates and all.

The fact that the problem is hard doesn't mean it's insurmountable. The existence of refactoring tools like Clang-tidy (far from the only example of such tools) demonstrates that industrial-strength C++ source transformation tools can be developed. It's nonetheless worth noting that such tools have to take the existence of templates and the preprocessor into account, and those are noteworthy complicating factors.

-- UPDATE --

A number of comments on this post include references to tools that chip away at the problems I describe here. I encourage you to pursue those references. As I said, the problem is hard, not insurmountable.


Unknown said...

How do you propose to deal with ifdef aka conditionally activated code. The classic example being compiler/os/(debug|release) specific code.

In theory you want ALL the code transformed. Including those depending on header or language extension not available on the current platform, while still making sure all the code path remain valid.
In practice, I don't see how a code transformer could understand those semantics

Let's imagine

#ifndef CPP11
#define NULL 0
#define NULL nullptr

What do you do then ?

Rein Halbersma said...

I don't think the situation is all that bleak. With an extra level of indirection, the code can be properly transformed and diagnosed. E.g. Kumar, Sutton and Stroustrup (http://www.stroustrup.com/icsm-2012-demacro.pdf) discuss how to automatically transform macros to C++11 declarations.

In the case of your first example, this would transform to a `constexpr auto ZERO = 0;` After this transformation (or the equivalent `enum { ZERO = 0 };`, both gcc and clang give appropriate warnings about assigning constants to pointers.

For your second example, the template definition should also wrap the magic constant 0 into a variable and then again, both gcc and clang will warn about it.

Rein Halbersma said...

I think a general lesson is that literals (either generated by macros or hand-written as in your template definition) are very dangerous in a generic context. A much safer definition for your template would be to do `obj = T{};` and then p will be set to nullptr. Compilers should have no trouble discovering literals in generic contexts and warn about them.

Andy Krouwel said...

I've been wondering if a tool that just throws 'code too clever/obtuse error' when it encounters situations like this might not be valuable in its own right. It doesn't have to fix it, just highlight the situation.

C/C++ has all kinds of nutty variant ways of doing things and flagging them up is actually very useful as clever/obtuse code is more likely to contain bugs and be a general maintenance nightmare.

Unknown said...

For an implementation of that macro replacement download and try Cevelop from htttp://cevelop.com

Matt said...
This comment has been removed by the author.
Matt said...

The C-Reduce group has run into similar issues -- and made some progress:

Scott Meyers said...

@Rein Halbersma: Thanks for the reference to the ICSM paper. Between that paper and the papers it references, it's clear that there's been a fair amount of work on classifying uses of the preprocessor and trying to deal with them.

Regarding your comment that both gcc and clang issue warnings about using a zero-valued constant to initialize a pointer, the latest MSVC issues no warning, not even with /Wall. This is emblematic of the problem with "compilers can issue warnings" thinking about dealing with "bad" C++: it's dependent on cooperation by all compiler vendors.

Scott Meyers said...

@Corentin Jabot: Conditional compilation for different versions of the language is a tough nut to crack, no doubt about it. In the ICSM paper Rein Halbersma referenced in his comment, conditional compilation is one of the preprocessor uses the researchers didn't even try to deal with.

I don't have a proposal for how to deal with this problem, but I suspect that there are people working on it, and I have faith that progress will be made.

Scott Meyers said...

@Andy Krouwel: Static analysis tools that flag code that's problematic in some way (e.g., too complicated) have been around for a long time, and I agree, such tools have value.

Rein Halbersma said...

@Scott Warnings from 2 out of 3 major compilers (haven't got access to Intel) isn't too bad, and with the appropriate bug reports, such QoI issues should get resolved sooner rather than later.

Unknown said...

@Scott The "use nullptr" transformation hasn't been ported from clang-modernize to clang-tidy yet. If you run clang-modernize -use-nullptr on:

#define ZERO 0
auto x = ZERO;
int *p = ZERO;
int y = 0;
int *q = 0;

... it transforms the last line to "int *q = nullptr;" and leaves the rest alone.

Scott Meyers said...

@Unknown: Thanks for explaining that clang-tidy doesn't yet have the checks from clang-modernize. I get the same results as you when I run clang-modernize.

Daniel S. Wilkerson said...

This is a really hard problem, but Taras Glek at Mozilla managed to take our C/C++ front end, Elsa, and, using a preprocessor that embedded special comments, implement a source to source transformation that worked and that they used to transform the huge Mozilla codebase, as Taras reported in CodeCon 2009.

Elsa is maintained together with Oink which is a collection of C++ tools which use Elsa as their front-end; you can find Elsa and OInk here: https://github.com/dsw/oink-stack ; the Oink tool that does this is xform. (Oink provides other tools as well, such as a static dataflow analysis that was used to find many bugs in Debian: http://http.cs.berkeley.edu/~daw/papers/fmtstr-plas07.ps .)

Note that Elsa is by now a bit behind the latest C++11 and C++14. However the hard part of bringing Elsa up to date amounts to adding the new language features to the parser. The Elsa C/C++ grammar is implemented in Elkhound which is a GLR parser generator, so many shift-reduce and reduce-reduce parsing conflicts simply go away, so doing this should be straightforward.

Scott Meyers said...

@Daniel S. Wilkerson: Thanks for this information. I used to follow Taras' blog posts, so I've heard of Elsa and Oink, though I've never used them.

Do you have any sense for how difficult it is to create static analysis and source code transformation tools atop Elsa/Oink instead of clang? The more toolkits we have for developing sophisticated C++ tools, the better for the community!

Daniel S. Wilkerson said...

@Scott Meyers: Well I wrote the Oink dataflow analysis, but I have not written an analysis on C++ using Clang/LLVM, so I do not know that half of the answer.

I provide many example analyses in Oink for writing various analyses: dataflow (my flagship Oink analysis), statement-granularity control flow (leveraging Scott's control flow analysis in Elsa), source-to-source xform (using Taras's code), etc., so I made it as easy as I could for people to get started writing analyses on Elsa/Oink; for example, to do a dataflow analysis all you should need to do is to take our existing dataflow analysis and add client code to propagate whatever information you want along the dataflow edges.

Our approach to building a tool is quite "deconstructed", by which I mean that each layer/module/aspect we added to the system deliberately did just one thing. So for example, in Elsa you can stop after parsing but before disambiguation and print out the ambiguous "parse forest" as Scott McPeak called it, in case you ever wanted to actually see the C++ ambiguities in all their glory. We never designed it for source-to-source transformation, but since we did such a deconstructed design, Taras was able to do it (I told him I didn't think it would be possible :-) ). Taras said it was the only C/C++ front-end out there on which he could do what he wanted.

Some of the more esoteric parts of C++ were never finished: I don't think I finished adding dataflow edges for multiple inheritance, for example; I think templates are close to being complete, but I don't know if we ever did some of the esoteric cases there. Further, I should say that right now Elsa/Oink is in "suspended animation." That is, none of us are actively using it or improving it, however it is not abandoned: I do attempt to answer questions and when people send me bug fixes I do attempt to coordinate with them in including them into the codebase. If someone got enthusiastic about working on it, I would be glad to assist them.

Daniel S. Wilkerson said...

I should say the GLR parsing algorithm swallows many local ambiguities; printing the parse forest will print those ambiguities which remain after parsing is finished and which therefore require typechecking to resolve.

Anonymous said...

Surely the preprocessor is, as you recently argued for C++ itself, due for some breaking changes to make analysis and transformation of source code practical?

A simplified preprocessor, compatible with everyday use cases, but not supporting pathological uses, seems quite feasible, e.g.:

* Make #defines immutable (no assiging a new value with a second #define)
* Everything gets included as if it were #pragma once (no re-including of files again in a new context)

Another rule (which requires a compiler to enforce) is each #include file must compile as a standalone translation unit e.g., if you have foo.h, then you must be able to copy it to foo.cpp, and compile it without getting an error. This rules out constructs that span the multiple files like this:

#include "if-and-open-curly.h"

(Additionally, it proves that the header is self-sufficient, and doesn't depend on earlier #includes or #defines from the including file).

Scott Meyers said...

@Anonymous: Breaking changes to the preprocessor would certainly be possible, but that's really in the domain of my previous post. This post focuses on one of the challenges facing people who want to write automatic source code modification tools to mitigate the effect of breaking changes, and as such assumes that the language to be transformed is the current one, including the current preprocessor.

Unknown said...

we need to deprecate the preprocessor it's an abomination, there are already language structures to replace many uses of macro's etc like inline functions generics/templates but we need alternatives to #ifdef and co and other such preprocessor magic also modules will get rid of the need for #include etc.

I think it will take more like 20 years to eliminate the preprocesor though but meanwhile source code transformation tools could take us a long way

Unknown said...

This blog awesome and i learn a lot about programming from here.The best thing about this blog is that you doing from beginning to experts level.

Love from