Tag Archives: type system

Clarity and bug prevention using phantom types

It’s inevitable that you will use similar data types and structures to represent different concepts in your code. For example, a string might represent raw data from an HTTP request, but it might also represent data which has been normalised and encoded for display in a web browser.

Often the context is clear as to what concept the types map to – the function and parameter names tell you. But what happens when multiple concepts map to the same basic data types? In a slowly-changing code base, how do you make sure you don’t let data that means one thing from being treated as if it means something else?

Let’s take a trivial example in C++. Imagine trying to get a row of records from a large table, but you only want data from a subset of the fields in that table. Here’s a likely class design for the table:

class Table {
  // various details elided
  vector<Value> getRow(size_t row, const vector<size_t> &indexes);
};

Insert row number, and for each column you specify by index, you get the value corresponding to that column. Simple enough; over the next few weeks you write a bunch of code which refers to the columns by index.

Later, the table/column design is extended so that columns correspond to a distinct field; that way, tables containing the same field can share data. Now in many places it’s easier to access the fields by their unique identifier instead of needing a table/column pair. Your Table class gets updated:

class Table {
  // various details elided
  vector<Value> getRowFromIndexes(size_t row,
                                  const vector<size_t> &indexes);
  vector<Value> getRowFromFields(size_t row,
                                 const vector<size_t> &fieldIDs);
};

Next, you change a function which used to use column indexes and instead use field identifiers. Did you remember to update all of the getRowFromIndexes calls to getRowFromFields? What about in functions called by the function you just changed? The compiler isn’t going to tell you if you forgot – there are only two differences between getRowFromIndexes and getRowFromFields: a name and an implementation, and the compiler only cares that the name you picked exists. Congratuations on your new potential bug.

Even worse, if your field identifiers are in the same kind of range as your field indexes, you might not even get warned about it by array bounds checking or segfaults, silently returning bad data. I speak from experience; this is not a fun thing to deal with.

The ideal solution to this is to automate the process of verifying that the semantics of data passed to a function match the semantics of the data the function expects to receive. If that doesn’t seem familiar to you, ask yourself something similar:

Can you automate the process of verifying that the structure of data passed to a function matches the structure of the data the function expects to receive?

The answer is yes – the compiler makes sure you can’t pass an int to a function that expects a string, nor will it let you store an instance of the SquarePeg class in a value of RoundHole unless there is a valid constructor or conversion operator for that. All we need is a simple way to embed semantics in structure, and that’s exactly what phantom types do.

Phantom types are merely a way to annotate a data type in such a way that the type system considers it distinct from the original data type, and for other instances of that data type that have different annotations. In our Table class, we can make the compiler prevent column indexes and field identifiers being passed to the wrong functions:

class Table {
  // various details elided
  vector<Value> getRowFromIndexes(size_t row,
                                  const vector<ColumnIndex> &indexes);
  vector<Value> getRowFromFields(size_t row,
                                 const vector<FieldIdentifier> &fieldIDs);
};

In fact, since these functions are now different types, we can even overload them safely:

class Table {
  // various details elided
  vector<Value> getRow(size_t row, const vector<ColumnIndex> &indexes);
  vector<Value> getRow(size_t row, const vector<FieldIdentifier> &fieldIDs);
};

And now we don’t even have to care what data type we have; the correct implementation of getRow will be used whether we have references to columns or to fields.

The question remains, how do you generate these phantom types? Languages with better-developed type systems like Haskell and Rust make it almost trivial, but you can still do it in C++ with a little work. It’s not as simple as creating a new class which inherits from the class you want to parameterise (which in any case precludes you from using fundamental types) as you need to obey a few rules:

  1. You cannot create a phantom type directly from another phantom type with the same type parameters. Hopefully very obvious, otherwise the types are not policy but merely comments.
  2. You may only create the phantom type explicitly. In the example, you shouldn’t be able to pass any old integer to a function expecting a ColumnIndex; there should be only a few approved sources of ColumnIndexes which sanitise their inputs.
  3. You may only get the underlying value in the phantom type explicitly. This is effectively the reverse of the above rule. Most function accepting ints are not appropriate for a ColumnIndex, and without this restriction you can accidentally circumvent rule 1. Unpack it at the lowest level at which you really need the value and you will minimise the risk of misusing the value.

While it sounds restrictive, it does not preclude you from using two instances of a phantom type with each other. There is no problem concatenating two UnsafeStrings or two SafeStrings, but concatenating an UnsafeString with a SafeString should be considered a mistake by default. Similarly, member access of a value in a phantom type is also fine.

If you want to try this out for yourself, there is a good discussion on Programmers Stack Exchange how you might actually achieve this in C++. The generic wrapper which lets you use operators with two instances of the same phantom type can be found in this paste and makes heavy use of SFINAE, so be brave!