Friday, July 25, 2008

Notes on source code preprocessing

(This post contains some reflections on the hypothetical design of an ideal programming language...)

Don't use or support the use of preprocessing source code. Preprocessing means that all the tools which operate on the source code (including editors, compilers, static analyzers, etc.) will necessarily need to either support preprocessing themselves, or call the preprocessor before operating on the resultant source code.

Consider, for example, a simple tool that parses C code and searches for a given text in all the literal strings found in that code. If the source code needs to be preprocessed, then certain parts of the code may not be searched if the part in question is contained within the equivalent of a C-preprocessor #ifdef/#endif block. On the other hand, if we want our tool to also find strings inside these sections, then we can no longer use a C parser because #ifdef and #endif are not recognized by the C parser proper.

(The solution to the problem might include using regular expressions as a kind of heuristic to determine where the strings are, and then do concatenations, etc. However, this is only an approximate solution, and is in general not entirely satisfying. What about the more complex task of looking for specific variable or function declarations? In fact, a number of such tools have had to deal with variations of this problem; see for example LXR and Coccinelle.)

Multiple, different types of preprocessors also don't mix very well. For example, as an extension to many programming languages, the source code is preprocessed so that SQL queries are substituted with the (possibly verbose) support code which would otherwise be needed to prepare and execute the query in question. Now we have exactly the same problem as mentioned above, which is that support tools (editors, compilers, etc.) will have to either support the preprocessor language or (more likely) always operate on already-preprocessed source code, with all the aforementioned drawbacks. In some cases, the order of preprocessing will also become significant, or different preprocessor languages might be incompatible.

Another practical aspect of preprocessing is that the code which is inside such #ifdef blocks will be compiled only conditionally; the compiler might not even look at it. This means that the compiler is a lot less useful than it could be. It is one of the main tasks of the compiler to inform the programmer when he/she is writing something which is internally inconsistent, such as calling a function with the wrong argument types. (This is entirely possible if there is a caller of the function inside an #ifdef block and the definition of the function was later changed to take different arguments.)

In conclusion, I think C/C++ could have done a lot better in this area. On the other hand, a lot of other programming languages did get it right. But maybe the need for alternative compilation is greater in the lower-level languages and it's really just a trade-off between performance and usability.

No comments: