Don’t put self-documenting code on a pedestal

I don’t look at comments, they’re always out-of-date. Yogi Berra Co-worker

Good coding discipline is important, and we have many rules of thumb to guide us as we design and write or programs. We should also remember that blindly following your heuristics is unwise; there is no silver bullet.

The ends usurped by the means

I’d like to draw your focus towards the notion of self-documenting code. I often hear programmers telling others how hard they try to make their code self-documenting. I think we all know at least one occasion where we or someone we know has, with much pride, got rid of several comments by careful variable naming. But rarely have I heard someone mention making an effort to produce easy to understand code.

Isn’t it the same thing, you may ask? They may be correlated, but you can have one without the other. Poorly-designed software can be full of code which is self-documenting when taken in isolation, but the way modules are combined and the choice of algorithms to accomplish each task can be utterly confusing in a macroscopic view of the program. Sometimes you need to know why the code does what it does, but the self-documenting approach only tells you how. Conversely, literate programs are mostly documentation and can be very easy to understand.

Reach for the stars, fall in the gutter

Most commonly, code is none of these things, and often ripe for improvement. During some code review, we looked at the following block of string processing code which is responsible for grouping together strings recursively by the bytes within each character until it can no longer be subdivided, then processing each group found. (I expanded the example from the original posting date to give some more context here.)

void groupStrings(pair<int, int> currentGroup, bool upperBitsCleared,
                  int currentBits) {
  vector<pair<int, int>> stringGroups;
  // code to populate stringGroups elided
  firstElement = 0;
  if (upperBitsCleared && currentBits == (MAX_BITS - 1)) {
    // this implies we have reached the end of the strings in the first list,
    // which is our base case
    finishProcessing(stringGroups[0]);
    firstElement = 1;
  }
  for (int i = firstElement; i < strings.size (); ++i) {
    groupStrings(stringGroups[i], upperBitsCleared && i == 0, i);
  }

The proponents of self-documenting code among me seized on the condition and associated comment. “Can we make this more readable?” Their suggested outcome was to extract the conditional expression into a method:

void groupStrings(pair<int, int> currentGroup, bool upperBitsCleared,
                  int currentBits) {
  vector<pair<int, int>> stringGroups;
  // code to populate stringGroups elided
  firstElement = 0;
  if (isEndOfString(upperBitsCleared, currentBits, MAX_BITS)) {
    finishProcessing(stringGroups[0]);
    firstElement = 1;
  }
  for (int i = firstElement; i < strings.size (); ++i) {
    groupStrings(stringGroups[i], upperBitsCleared && i == 0, i);
  }

However, the new name is unspecific. Only the first element of the array ‘stringGroups’ contains strings that have no more characters to process; it is not true for the entire array, or even any of the input arguments! What did we really gain from this? According to the teachings of ‘Uncle Bob’, we avoided failure. In his book Clean Code, he says,

The proper use of comments is to compensate for our failure to express yourself in code. Note that I used the word failure. I meant it. Comments are always failures.

The implication in the rest of the chapter is that anything you can do to remove that comment is automatically an improvement, with exceptions (and this isn’t one of his exceptions). However, what happens if you don’t have the ability to accurately judge whether you really expressed that comment in your code?

It’s easy to tell yourself something is obvious and doesn’t need to be explained once you’ve already gained that understanding. Of course we know ‘isTerminal’ relates to the first element, and that this code handles the base case of our recursive algorithm. But how about six months from now? After a dozen bug fixes and feature requests? When code have been added or changed, is it still going to be that obvious?

Another of Uncle Bob’s teachings is the Boy Scout rule: “Always check a module in cleaner than when you checked it out.” The point of the rule is to encourage gradual refactoring of the code into an easier-to-understand form. However, this doesn’t mean any effort is positive: if you vacuum a carpet, but wear muddy boots doing it, only the gnarliest of dust bunnies can justify the damage caused. Even if you believe you’re improving the code, you cannot claim to follow the rule if you lack the perspective to evaluate that improvement.

How odd I can have all this inside me and to you it’s just words

The paraphrase at the top of this essay was said at this code review, and is the reason why I wrote this at all. I concede that nobody forces you to update comments when the code surrounding it changes. Yet comments and identifiers have more in common than you think.

The biggest criticism of comments – that their content does not match reality – is equally true of identifiers. The compiler cannot check the syntax of language used in your identifier name. It cannot warn you when you are storing the end of the range in a variable named ‘start’, nor the width of an object in a field named ‘height’. It cannot tell you when you changed the behaviour of a function, and its name is no longer accurate.

Worse, you are further constricted by the length you can reasonably give an identifier. There’s a reason why nobody writes essays on Twitter: not every concept can be expressed in 140 characters or less.

If one of your axioms is that comments are worse than any code you could write, you are making a mistake. The only alternative to comments is using appropriate variable and function names. It is said that there are two hard problems in computer science, cache invalidation and naming things; it takes a lot of hubris to believe you can name things as well as you can describe them.

Regardless, you must try to name things as well as you can, but be careful. The capacity of a bad name to confuse and mislead the reader is boundless. A bad name unexplained is worse than no name at all.

Excuse me, I believe you have my stapler

This isn’t even considering the benefits of inlining code at its only call site rather than creating an elaborate tree of indirections throughout the source. Unfortunately this is a tough sell, as self-documenting code’s only means of describing blocks of code is through named functions. Take that away and what is left? But even to try and use functions this way is a mistake in my opinion.

Organising everything into small chunks of code works well when your code is perfectly designed and the problem it is trying to solve can be reduced to a series of independent processes. Often, you can ensure the code is reusable, or maps a set of inputs to a set of outputs in an easily-understandable way. It works because the code is simple.

In most real projects, we have to endure incomplete design or unmanaged inherent complexity in the problem domain. For these, it’s a suboptimal strategy unless you are simultaneously removing that complexity. Not all functions extracted from a block of code are fully independent from the code you extracted it from, usually because it expects its inputs to be in a particular state. Extracting that function does not remove complexity from that code. Instead it spreads it around, making it harder to understand why the function does what it does, and allowing other developers to use a function in the wrong context, potentially introducing bugs.

If your goal is to organise the code, commenting does this without removing the spatial context of the code it is complected with. Save functions for eminently reusable code. Functions imply reusability and the movement from one level of abstraction to another. Hijacking them for the purpose of code organisation is a misuse of structure. Structure is important: that’s why the GOTO wars were fought.

Here is an egregrious example from Uncle Bob, where a 10-line function does something more than its name suggests. Rather than documenting its behaviour, it is exploded into an English-language DSL that buries the unique feature of the class not described in its name – replacing a symbol at most once – three levels deep in the call stack. Local variables become class members, and since any method can alter a class member, we have increased the overall complexity of the original code. If trying to document your code makes it more complex, you are probably doing something wrong.

I hope most programmers are not as extreme as him, or at least not influenced too much. It is at least illustrative of the problem that self-documenting code is no replacement for good documentation. In the instance where you are trying to replace a comment, you are merely going from one form of documentation to another – and in a medium where it is harder to express yourself. This is not to say we shouldn’t try to make our code explain itself, but that it is only one of many means to the true goal: to write code that is easily understood and remains so for years to come.

7 thoughts on “Don’t put self-documenting code on a pedestal”

  1. I feel that the example is too ambiguous for me to be able to have an opinion on the refactoring. You’re not giving the reader enough information to decide if this refactoring is a step in the right direction or the wrong direction.

    1. Fair point, and I have made the example more concrete as I’m not trying to set up some straw man here. The point of the example is not to judge if a good refactoring is possible – it is – but that it’s easy to commit to a bad refactoring that loses information, and to question why that would happen.

  2. “we looked at the following block of string processing code which is responsible for grouping together strings recursively by the bytes within each character”

    What property of the ‘bytes within each’ character is used to group them together ?
    In other words, if A and B are two characters and if they belong in the same group if f(A) = B, ( where f is a function ) what is f ?

    “void groupStrings(pair currentGroup, bool upperBitsCleared,
    int currentBits) {
    vector<pair> stringGroups;
    // code to populate stringGroups elided”

    I don’t get it. You are passing in currentGroup ( pair of ints ) which is ignored and then populating a new vector of pair ints after each recursive call ? Are you sure this code represents the actual code faithfully ? What EXACTLY was the actual code doing ?

    1. ‘f’ is a parameter to the algorithm, which is also equivalent to transforming the input strings before executing this function. I see why you are asking, but general-purpose library code doesn’t always get the luxury of concrete naming. For example, the algorithm is used to perform deduplication in one module and sorting in another, so the function can’t reasonably use concretions from different concepts.

      The only thing I have hidden from you is that currentGroup is used to populate stringGroups in the elided code. The pairs of ints define ranges of indices for a separate array of strings where it has been determined the strings are equivalent, or can be subdivided further.

    1. That doesn’t seem very recent 🙂 but nonetheless a good read!

      “Keeping documentation close to the code and close to the API” is very significant to me. If you work on a project with public documentation, when was the last time you looked at that documentation? Just the other day I found an instance where the behaviour of a feature diverged from the documentation quite some time ago. Open source software is notorious for this. I don’t have a solution, just lamenting the problem of communicating developer changes through to user-facing documentation.

      1. That doesn’t seem very recent

        Fast times we live in when 2 years are an epoch and not recent 😉

        If you work on a project with public documentation, when was the last time you looked at that documentation?

        All the time. I expect APIs not to change, so if OSS (or commercial products alike) are well-written, this generally shouldn’t be a problem.

        In our case (jOOQ), we know how important documentation is to our users and customers. They don’t want to dig into the sources all the times, they want to be able to google things. But sometimes (rarely), having docs go out of sync with the actual implementation is inevitable, unfortunately. One thing’s for sure: No docs is not the solution.

Leave a Reply

Your email address will not be published. Required fields are marked *