Tech Tuesday: Routing and TCP/IP

We are continuing along with the web request cycle. Last week we took a look at the HTTP protocol. There I already mentioned that HTTP requests and responses travel over a TCP/IP connection. Today we will dive a bit deeper into TCP/IP. This is technically not really necessary for understanding the request cycle because these lower levels of the network are completely abstracted away when you develop for the web (which is a fancy way of saying you get to use it without worrying about how it works). Yet, peeling the onion a bit further will turn out to be very useful to the overall understanding of how things work on the web.

In the Tech Tuesday on networking, I introduced the idea that the Internet is a packet switched network. As a refresher this means that data gets cut up into packets. The IP layer is responsible for how these packets move across the network. What follows is quite a bit of a simplification but good enough for our purposes here. Each packet (sometimes also referred to as a datagram) has its own header which contains among other things the source and destination IP addresses. These packets travel between machines along flexible paths known as routes. There is a tool called traceroute for examining what these routes are and it is worth trying this out.

On a Mac, use Spotlight to find and start the “Terminal” application. You will get a new window with a prompt which lets you type commands (this is known as the command line and we will learn a lot more about it in a future Tech Tuesday). Type “traceroute google.com” and you will see output that looks something like the following:

Each one of these lines represents a so-called “hop” – meaning packets traveling between two internet devices. The first hop is from my computer to my home switch. The second hop is from there to my home VPN device which is connected to a cable modem from Time Warner. From there the packets travel over a whole bunch more intermediate switches and routers until the get to a server operated by Google. You can try this with other servers as well, such as “traceroute www.dailylit.com” – if the output get stuck with lines containing just “* * *” instead of information on hops, then you can terminate the process by pressing Ctrl-C. For those of you on Windows, here is how to run a traceroute.

Now the really important part to keep in mind about the IP level of the protocol is that it is strictly best efforts. This means that packets can travel different routes, can get dropped and can arrive out of order at the destination. So how in the world do we get an HTTP request and response across such a fundamentally unreliable network? Well that’s where the TCP portion comes in. TCP the Transmission Control Protocol sits on top of IP and provides for guaranteed in-order delivery of packets. How does it do that? Well, the details are complicated, but for our purposes it is sufficient to understand that it starts with a fair bit of initial “handshaking” (back and forth) where the two endpoints (sender and receiver) agree on what they will do. Once that “connection” has been established it becomes possible to keep track of which packets have been received and which have not and to cause packets that might have been dropped to be resent.

What are some of the takeaways here? First, having fewer hops will make things faster. If you try different servers with traceroute, you will see that a lot of servers are more hops away than Google’s – Google has invested heavily in shortening the paths to their servers. This is also what so-called CDNs or Content Delivery Networks do. They bring content (e.g., images) closer to the “edge” of the network so that requests have fewer hops. Second, setting up a TCP connection involves a fair bit of overhead. In the first version of HTTP each request required a new connection which was very inefficient. With HTTP 1.1 a single connection is kept alive for a sequence of requests and responses (a session). But there is still a separate connection required for each different server and so a web page that connects to many different resources incurs more overhead. Third, if you really want a lot of speed it helps to reduce the number of packets that need to be sent. In the early days, the entire home page of Google was optimized to fit into a single package.