Today we are continuing with the web cycle that I outlined two weeks ago. After a URL has been parsed in Step 1, the browser needs to determine the IP address for the domain as Step 2. Reprising the previous example, let’s consider the domain name dailylit.com. How does the browser determine that in order to retrieve information from this domain it should access a server at IP address 72.32.133.224? This is accomplished via a system called DNS, which stands for Domain Name System, and provides essentially the equivalent of a telephone book which provides IP addresses (telephone numbers) for domain names (people names).
In ARPANET, the predecessor to the Internet, there were so few domain names that this telephone book was simply a file called HOSTS.TXT that was retrieved from a computer at SRI and stored locally. There were only a few domains (mostly universities) and the file was relatively short. Today on the Internet there are over 200 million domain names of the type dailylit.com, which are further subdivided through subdomains such as blog.dailylit.com. So the idea of having every computer maintain a complete and up-to-date copy of the telephone book locally doesn’t make sense any more.
Thankfully in the early 1980s, which depending on your perspective is either ancient pre-history or not that long ago, DNS was born as a service that would allow the registration of domain names and maintain a mapping between the names and IP addresses in a robust fashion. In fact, without DNS it would be hard to imagine the Internet having grown as dramatically and we probably wouldn’t have nearly as many domains to begin with.
There are many ingenious ideas in the design of DNS and I won’t be able to cover them all here. Instead, I will focus on some key concepts. The first and central one is that there is a hierarchy of authority which allows for the delegation of both registration of domain names and the lookup of IP addresses. The hierarchy starts with the 13 root servers which together make up the so called root zone from which all authority flows. It is here that the so-called Top Level Domains or TLDs get resolved. Going back to blog.dailylit.com, the TLD is the “.com” part. You can think of a domain name like nested Russian dolls, where the outermost doll, the TLD, is the rightmost part of the name.
The most common TLDs are .com and .net which together account for about half of all domain names. There is of course also .org, .gov., .edu and an ever increasing number of other TLDs such as most recently .xxx. And then there are TLDs for countries which all consist of two letters, such as .uk for the UK (duh) or .ly for Libya, popularized by bit.ly, and .us for the US, which made the domain del.icio.us possible. Each TLD has one or more registrars associated with it who are in charge of letting people and companies reserve names in that domain.
The root servers point to name servers for each of these TLDs. Since blog.dailylit.com is in the .com domain the next place to look is the .com name servers. The .com name servers in turn point to the name servers for dailylit.com itself. Currently those name servers are at Rackspace. Since Susan and I registered and control dailylit.com, we are the ones who get to decide which nameservers should be queried to find the IP address for dailylit.com and its subdomains, such as blog.dailylit.com. The way this generally happens is by logging into a system run by a registrar and setting which nameservers are to be the authoritative sources of IP addresses for the dailylit.com domain. That then gets recorded in the nameserver for the corresponding TLD.
The lookup process that started with the root, went to the .com TLD, is now at the dailylit.com nameservers at Rackspace. They in turn contain information on dailylit.com itself and its subdomains, such as blog.dailylit.com. The whole process of starting at the root and working towards the subdomain (right to left) in a series of separate lookups across different servers is called a “recursive lookup.” If this sounds complicated to you, that’s because it is. It is so complicated and resource intensive that we don’t want the web browser to have to do this each time it encounters a domain name. It would not only be slow, but it would also swamp the root servers, the TLD servers and possibly even the name servers for dailylit itself.
So instead of doing a recursive lookup every time, the results of these lookups are stored on so called DNS cache servers. For instance, most ISPs through which you access the Internet will operate their own cache servers. After they have looked up blog.dailylit.com once, these servers will “cache” (meaning temporarily store) the result of the lookup, thus providing a much faster lookup the next time. In fact, your own computer will often cache the results of lookups locally for super fast access. This is important both because even a single web page generally involves multiple requests (e.g. for images) to the same server. The duration for which the results of a recursive lookup can be cached locally is known as the Time To Live or TTL and is controlled by the owner of the domain (and generally honored by the cache servers).
The existence of cache servers (sometimes also referred to as non-authoritative servers – although technically not exactly the same) provides a critical security vulnerability for DNS. Let’s say you have gone to your favorite coffee shop and logged on to the WIFI network there. Where do your domain lookups go? Well to the cache server of whatever ISP the coffee shop uses or possibly even cache servers on the coffee shop’s own network. An attacker with access to those local cache servers could insert falsified records that could have the effect of say pointing chase.com to some rogue server that wants to steal your bank username and password. This would allow for a so-called man-in-the-middle attack (more on this in a future post). Fortunately, some security additions to DNS known as DNSSEC will in the future prevent these kinds of attacks. As more and more of our access to the Internet is over wireless networks this becomes particularly important.
If you made it this far, I hope you have a (newfound) appreciation for the complexity of a system that is used billions of times per day behind the scenes of nearly every access to the Internet. In addition to the technical issues there are also important political issues surrounding DNS. Most recently the proposed SOPA and PIPA legislation would have mandated nameserver operators to make changes that would have interfered with the implementation of DNSSEC. Then there is also the question as to who really controls the root zone which turns out to be the US Department of Commerce. Yes, for the *entire* Internet, which is all the more reason why we should make DNS better not worse.