System Stability: Prepare for More Blockchain Fail Whales

One of the key criteria for the longterm success of any computer system would seem to be stability. For instance, if there is an app that crashes your phone you will use it less or even uninstall it. While it is frustrating when this happens, it is quite amazing that computer systems have any stability at all given their complexity. Compared to most systems that humans had previously created, computers are by far the most complex, especially once you factor in the software that runs on them.

From that perspective it is quite a miracle that our computers, phones, networks are stable at all. That stability is achieved through a variety of design strategies, such as layering and state localization. As I am writing this on my computer and posting it via Tumblr (still) it is incredible just how many different layers are involved. On my laptop there is the hardware, an OS kernel, OS user space, and applications. On the network there is the physical, data link, network, transport and data layers all separated. Layering achieves stability because each layer only needs to provide a limited set of assurances. And as long as those are provided the higher level layers will continue to operate.

Generally within each layer there are lots of components that can fail independently without bringing down the entire layer. For instance, in the network individual routers can fail. Localization of state means that if my laptop crashes it (generally) doesn’t also bring down the bank computer I was connected to and thus take away your ability to access your bank account. The bank computer usually doesn’t wait for my computer blocking everyone else.

Achieving that kind of stability is something that will be an interesting challenge for many of the emerging decentralized computation platforms such as Ethereum. The complexity of the interactions between different programs (smart contracts) is difficult to predict and understand. We are also building many of the layers simultaneously rather than sequentially, such as consensus protocols and the computation on top of those.

These problems are already hard but will get especially difficult when we attempt to scale these systems to increase their throughput. This raises thorny questions about localizing state while still maintaining a consistent global ledger. There are also interesting new failure modes such as the possibility of an error in the cost estimation for running calculations.

I fully expect that we will get there eventually. But it is likely to involve a number of false starts and for some time we should all be prepared to live with the decentralized equivalent of Twitter’s Fail Whale.