The team at Twilio has done an amazing job making telephony accessible to developers. Making and accepting calls, as well as sending and receiving texts is now as easy as a couple of lines of code. The simplicity of Twilio’s API is such that most developers are literally up and running within minutes. At the same time because of the ingenious REST API which makes calls and messages addressable, there is no limit to the complexity of the applications that can be created using Twilio.
In typical Twilio fashion you can try Twilio client right from the product home page. So head on over there and try it out!
I have written previously that given the huge scale of the Internet, services that might historically have been considered a feature can now be companies. There is a critical success factor though to achieving Internet scale that I think is being widely ignored: intentionally keeping a service “underdetermined.” What does that mean? A service is underdetermined if it allows for many use cases including ones that were never possibly anticipated during the creation of the service. Conversely, a service is overdetermined if the creators have a very specific use case in mind and build lots of features to support exactly that use case.
The growth of an overdetermined service is heavily constrained. If the features don’t meet someone’s needs, then the more features there are, the harder it will be to “co-opt” the service for their use case. When I talk about this with entrepreneurs, I jokingly give the example of where Twitter would be today if the service had been built strictly as a way to share information about baseball games. I then go on to describe a bunch of awesome features for selecting which game, which inning, which player the information is about. It’s easy to see how you could come up with a myriad of features to support this one narrow use case. That would be an extreme example of being heavily overdetermined.
Being underdetermined is hard though. Much harder than it would appear. First, it is generally much easier to have a specific use case in mind. Second, early adopters often use a service in a particular way and tend to be power users who want more features added to the service to support their own use case. Third, if a service isn’t taking off immediately, there is a temptation to add features as a way to drive usage (“if only we had feature x, people would start to use the service”). That tends to be a fallacy most of the time, unless x is actually critical to any use of the service.
I am not sure that there is or even can be a recipe for building a successful underdetermined service, but there are at least three lessons that I can identify
First - question every feature. Does this feature support one highly specific use case or is this something that supports many use cases? When in doubt, leave a feature out.
Second - avoid “pollution” from use cases. If someone is using the service in novel ways, their usage shouldn’t pollute the service for others. For instance, services such as Tumblr and Twitter make that easy because if someone uses it in a way you don’t like you simply don’t follow them (or unfollow them).
Third - and this one should not be controversial - launch with an API or make an API available shortly after launch. This lets others add features for supporting specific use cases on top of what you are doing.
There is an important corollary to all of this for entrepreneurs. If you are starting something new and you describe your startup as “this is [like service] x for [use case] y” — make sure to ask yourself whether use case y really requires a separate service or is already sufficiently well covered by (possibly underdetermined) service x.
We don’t do very well with predicting the future because we fail to anticipate non-linearity. This is apparent almost anywhere you look. Most people when they hear that the average temperature of the earth might rise by a couple of degrees, seem to have a mental image of a slightly warmer summer and maybe a bit less snow in winter (if you live in a place that has snow in winter). Very few people associate such a seemingly small change with the possibility of deserts in places that are lush today. Yet with non-linear systems that’s exactly what you can get from a small change.
Here are some other recent examples that I have run across. Apparently the plane that crashed in Buffalo was on auto-pilot during much of the time that ice built up on the wings. Aerodynamic lift is a perfect example of highly non-linear system. If you lose attachment of the airflow to the wing due to excessive icing or due to too high an angle of attack, lift does not decrease gradually, it simply disappears. The easiest way to think about this is by imagining a glass on a table. As you push the glass closer to the edge of the table, the glass stays at the same “altitude.” That’s even true if part of the glass is already hanging over the edge. But push just a tiny bit further and the glass drops to the floor. Even in this seemingly trivial example it appears that we don’t have a “built-in” safety model, as kids will inevitably put plates and cups down at the edge of a table and learn not do so only after considerable breakage.
The current financial crisis is of course another massive case of non-linearity. I have posted previously about how leverage vastly amplifies risk in a completely non-linear fashion. The overall financial system is full of feedback loops and most of them are positive loops, meaning that effects are amplified resulting in non-linearity. For instance, as the stock market drops people are less wealthy. When they are less wealthy their tolerance for risk goes down and many stocks that previously seemed acceptable no appear too risky resulting in more sales. That of course drives the stock market down further.
Yet another great example is server load. We were talking to one of our portfolio companies yesterday that is experiencing rapid growth. One of the founders observed that their database server has fairly low load and even during spikes does reasonably well. He seemed to infer from that that they could handle much higher loads. But such an inference is deeply flawed and ignores the many fundamental non-linearities in server load. Let’s use a coffee shop as a simple example. As long as folks arrive at a rate that is less than the rate at which the barristas can make espressos, lattes, cappuccinos, etc there will be no build up of a line (for simplicity I am assuming a deterministic coffee shop, i.e. folks arrive at exactly the same interval and it takes exactly the same time to serve them). The second the arrival rate of customers exceeds the service rate, a line will start to form and that line will continue to grow as long as the arrival rate exceeds the service rate. So a tiny change in arrival rate will result in a huge change in wait time! So just because your server can handle current spikes does not mean it won’t completely croak on spikes at only a slightly higher level.
I have been reading a lot about education recently and how poorly it prepares us for a rapidly changing world. I was looking for examples of that from my own life and it struck me how little I learned about non-linearity in school or even in college. Given our apparent lack of built-in understanding of non-linearity this is a huge omission!
Yesterday we had a little snafu at one of the companies I work with. Some users received duplicate and even triplicate reminder emails. This snafu was caused by what I consider to be one of the key mistakes when first working on production systems: rookies trust their own code. By this I mean assuming that code you wrote yourself has the behavior you think it should have and then not guarding in other code against possible errors. The particular case in question involved two parts: code to identify certain users that should receive reminders and code to format and send those reminders. The developer had no checks in the sending code to guard against the identifying code providing erroneous information. In the worst case the sending code would have happily sent a single user thousands of reminders.
Obviously, writing and running unit tests will help but one can’t rely on those to catch everything that might occur once the code runs in the production environment against the production data. There is always the possibility of overlooked corner cases, configuration mistakes, or operator error (if human input is involved). So how defensive should code be? Basically I think it’s almost impossible to err on the side of being too defensive. So testing whether arguments are of the right type and in the right range is always a good idea. Making sure to catch exceptions or other error conditions (unless there is some global exception handling in place) is critical to make sure your code returns and users don’t see crud. Whenever you do something hard (impossible) to reverse, such as sending email to users (or deleting old data), ensure that you don’t do too much of it. “rm *” checks to make sure you wan’t do that for a reason. Your code should too.
In manufacturing, when stuff goes wrong, there tends to be physical evidence, such as a part with holes in the wrong place. It is therefore often easy to find the immediate cause of the problem, which might be to say that the wholes were drilled in the wrong place and go yell at the person who does the drilling. But in Kaizen, the immediate cause of a problem is only the beginning of the analysis, not the end. The team is supposed to start with the problem and ask “Why?” seven times to determine the root cause of the problem. Why was the hole in the wrong place? Because it was drilled in the wrong place. Why was it drilled in the wrong place? Because it was marked in the wrong place. Why was it marked in the wrong place? Because the person marking it measured from the wrong location. Why did they measure from the wrong location? Because the measurement directions were unclear. And so on.
There are several wonderful things about root cause analysis. First, it immediately changes the dynamic from finger pointing to learning. The immediate cause of any error is only the last element in a chain of causes and the goal is to understand that chain. Second, by working towards the root cause a simple problem may be found to have a root cause that is in fact responsible for many other problems. This is why it is important not to ignore little problems, which in turn ties back to the first post in this series. If you set seemingly outlandish quality goals, then every little problem is worthy of analysis. That analysis — if done right — will help uncover root causes that were either already — or would in the future be — responsible for many more problems.
In software development we tend not to have any physical artifacts to go on. If something is not working on a site or service, there are a myriad of possible different causes. So often identifying the immediate cause is already difficult. That results in an added tendency when a bug is finally found to just fix it — often with a hack — and move on (possibly after some yelling at the directly responsible person). In the Kaizen approach that represents a lot of lost opportunity for learning and lasting improvement. Using the “seven whys” takes patience but is great discipline. In fact so much so, that I recommend it not just for manufacturing or software errors, but anything you encounter that’s not going the way it should.
One of my favorite Kaizen techniques is visualization. On the shopfloor this takes the form of large signs that graphically display key quality metrics. The charts show overall trendlines but also break out individual teams. This is a powerful motivator. When there are large gains in quality, the credit can go to the team(s) that produced the progress. Conversely, when the overall chart shows a dip or a slow down in improvement, it is often readily apparent which team is responsible. Some people may find the idea of this level of transparency uncomfortable, but much depends on how successes and failures are handled. In Kaizen, successes are celebrated by all teams and failures are seen as an opportunity for learning (more on that in a separate post). That means when a team stands out as having dragged down performance the reaction of the other teams is not a “shame on you” but a “let us help you.”
Quality visualization in a development environment is surprisingly rare. I have seen very few teams where the first thing folks see when entering the development area (and when logging onto the intranet or wiki) are charts of quality metrics. That is all the more surprising as many of these can be collected automatically (unlike on a shopfloor where there tends to be a fair bit of manual effort). Site or service uptime and latency, number of bugs at varying levels of severity, time to close out bugs, missing or broken checkins, unittest results, etc can all be gathered in an automated fashion. Breaking this out by team may take some effort, but most folks aren’t even displaying aggregates, so there is a lot of room for improvement.
One mantra that is often brought up for entrepreneurs is to “fail early” (and some add “and often”). The theory is that if your business is not working it’s better to fold before you have spent a lot of money and try something else instead from scratch. There is not just the money - if you go too long in one direction you build an organization that has inertia in that direction making it difficult if not impossible to change direction. I have certainly seen this happen firsthand many times and one of my biggest investing mistakes was trying to take a consumer company in the personal finance space and help turn it into a B2B content provider.
But this mantra may no longer apply in a world of lightweight web services and cloud computing. For starters the amount of money required to start something has declined dramatically. So has the number of people required to build and operate a service. In a world where one or a couple of people can run a meaningfully big service (e.g. plentyoffish, tumblr), maybe “fail early” needs to be replaced with “iterate early.” Get something out there. Let people bang on it. Not working? Try something slightly different (or radically different for that matter). As long as you keep your burn low and your team small, you should have a lot of flexibility. We are beginning to see some examples of this approach, such as iminlikewithyou which launched as a dating site and is now a game platform.
Many of the high growth companies in our portfolio have run into scaling issues. There is a lot of information out there on various technical approaches to scaling. What most of those leave out is the interaction between the choice of architecture and organizational scaling. Some architectures lend themselves much better to organizational scaling than others. A horizontal approach with a Data Access Layer, a Business Logic Layer, and a Presentation Layer suffers from a lot of organizational coordination overhead. To implement new functionality the various horizontal teams need to coordinate so that anything can get done.
I am therefore a big fan of a services based architecture, which takes more of a vertical approach to dividing up systems. For instance, most web sites and services have a concept of a user profile. In a services based architecture everything having to do with user profiles might be encapsulated in one service (create a profile, retrieve a profile, etc). Organizationally it now becomes possible to have a team that’s in charge of the profile service. That team can make changes to the service implementation as long as the changes don’t break the service API. In fact, the team can even enhance the functionality of the service by adding new methods to the API. This allows for much better organizational scaling as innovation no longer requires nearly as much coordination. In addition to innovation by each service team, it’s also possible to innovate by combining the existing services in novel ways to deliver end user functionality.
I have always encouraged companies to spend on great chairs, keyboards and multiple monitors for developers. This was based on the conventional wisdom (among developers) that more screen real estate is better.
Recently, NEC sponsored a study that finds actual productivity benefits from using multiple displays. The study was conducted at the University of Utah. Ars Technica published a good overview and you can find a detailed summary
directly from NEC. Interestingly, the study shows that there are diminishing returns to screen real estate and also that in some cases having one slightly larger screen (20”) is better than having two smaller screens (18”) combined. The basic approach of the study — randomized assignment to different sequences of display size and random assignment to text editing and spreadsheet tasks — seems fine.
It would be nice to see this study repeated with coding tasks. I have a nagging suspicion that large screens and multiple screens have some significant negative side effects on code quality. In particular, I suspect that folks tend to write much longer code blocks when they have larger screens (both longer lines and more lines per function/method), which almost always translates into code that’s harder to understand and maintain. I have no hard evidence for that, but these days the only time I have to play around with development is on long flights on my laptop, which enforces a nice discipline.