Tuesday, February 7, 2012
Tech Tuesday: Routing and TCP/IP
We are continuing along with the web request cycle. Last week we took a look at the HTTP protocol. There I already mentioned that HTTP requests and responses travel over a TCP/IP connection. Today we will dive a bit deeper into TCP/IP. This is technically not really necessary for understanding the request cycle because these lower levels of the network are completely abstracted away when you develop for the web (which is a fancy way of saying you get to use it without worrying about how it works). Yet, peeling the onion a bit further will turn out to be very useful to the overall understanding of how things work on the web.
In the Tech Tuesday on networking, I introduced the idea that the Internet is a packet switched network. As a refresher this means that data gets cut up into packets. The IP layer is responsible for how these packets move across the network. What follows is quite a bit of a simplification but good enough for our purposes here. Each packet (sometimes also referred to as a datagram) has its own header which contains among other things the source and destination IP addresses. These packets travel between machines along flexible paths known as routes. There is a tool called traceroute for examining what these routes are and it is worth trying this out.
On a Mac, use Spotlight to find and start the “Terminal” application. You will get a new window with a prompt which lets you type commands (this is known as the command line and we will learn a lot more about it in a future Tech Tuesday). Type “traceroute google.com” and you will see output that looks something like the following:
1 192.168.1.1 (192.168.1.1) 1.987 ms 0.864 ms 0.794 ms
2 10.32.128.1 (10.32.128.1) 9.576 ms 8.510 ms 7.638 ms
3 gig-0-3-0-7-nycmnya-rtr2.nyc.rr.com (24.29.97.130) 7.983 ms 8.371 ms 8.123 ms
4 tenge-0-5-0-0-nycmnytg-rtr001.nyc.rr.com (24.29.150.90) 12.007 ms 12.481 ms nycmnytg-10g-0-0-0.nyc.rr.com (24.29.148.29) 14.716 ms
5 bun6-nycmnytg-rtr002.nyc.rr.com (24.29.148.250) 18.132 ms 11.899 ms 12.706 ms
6 ae-4-0.cr0.nyc30.tbone.rr.com (66.109.6.78) 7.120 ms 8.395 ms 8.113 ms
7 ae-4-0.cr0.dca20.tbone.rr.com (66.109.6.28) 13.161 ms 66.109.9.30 (66.109.9.30) 14.679 ms ae-4-0.cr0.dca20.tbone.rr.com (66.109.6.28) 13.992 ms
8 107.14.19.135 (107.14.19.135) 14.153 ms 12.694 ms ae-1-0.pr0.dca10.tbone.rr.com (66.109.6.165) 14.154 ms
9 66.109.9.66 (66.109.9.66) 15.230 ms 74.125.49.181 (74.125.49.181) 13.553 ms 66.109.9.66 (66.109.9.66) 13.315 ms
10 209.85.252.46 (209.85.252.46) 17.017 ms 14.467 ms 209.85.252.80 (209.85.252.80) 15.536 ms
11 209.85.243.114 (209.85.243.114) 26.926 ms 209.85.241.222 (209.85.241.222) 25.348 ms 25.406 ms
12 216.239.48.103 (216.239.48.103) 25.799 ms 64.233.174.87 (64.233.174.87) 25.046 ms 216.239.48.103 (216.239.48.103) 32.101 ms
13 * 209.85.242.177 (209.85.242.177) 40.436 ms *
14 vx-in-f103.1e100.net (74.125.115.103) 25.568 ms 26.283 ms 26.659 ms
Each one of these lines represents a so-called “hop” — meaning packets traveling between two internet devices. The first hop is from my computer to my home switch. The second hop is from there to my home VPN device which is connected to a cable modem from Time Warner. From there the packets travel over a whole bunch more intermediate switches and routers until the get to a server operated by Google. You can try this with other servers as well, such as “traceroute www.dailylit.com” — if the output get stuck with lines containing just “* * *” instead of information on hops, then you can terminate the process by pressing Ctrl-C. For those of you on Windows, here is how to run a traceroute.
Now the really important part to keep in mind about the IP level of the protocol is that it is strictly best efforts. This means that packets can travel different routes, can get dropped and can arrive out of order at the destination. So how in the world do we get an HTTP request and response across such a fundamentally unreliable network? Well that’s where the TCP portion comes in. TCP the Transmission Control Protocol sits on top of IP and provides for guaranteed in-order delivery of packets. How does it do that? Well, the details are complicated, but for our purposes it is sufficient to understand that it starts with a fair bit of initial “handshaking” (back and forth) where the two endpoints (sender and receiver) agree on what they will do. Once that “connection” has been established it becomes possible to keep track of which packets have been received and which have not and to cause packets that might have been dropped to be resent.
What are some of the takeaways here? First, having fewer hops will make things faster. If you try different servers with traceroute, you will see that a lot of servers are more hops away than Google’s — Google has invested heavily in shortening the paths to their servers. This is also what so-called CDNs or Content Delivery Networks do. They bring content (e.g., images) closer to the “edge” of the network so that requests have fewer hops. Second, setting up a TCP connection involves a fair bit of overhead. In the first version of HTTP each request required a new connection which was very inefficient. With HTTP 1.1 a single connection is kept alive for a sequence of requests and responses (a session). But there is still a separate connection required for each different server and so a web page that connects to many different resources incurs more overhead. Third, if you really want a lot of speed it helps to reduce the number of packets that need to be sent. In the early days, the entire home page of Google was optimized to fit into a single package.
Tags: tech_tuesday web networking
Friday, February 3, 2012
Learning to Love Sales
I started out in business selling development services. OK, so that’s hugely glorified. I was a teenager desperate to get my driver’s license in Germany where that is a costly process (lots of mandatory lessons). So I figured out how to make money programming custom applications for people, including a driving school. Ever since then I have had a love/hate relationship with sales.
The hate portion is easy to understand for any engineer. The things you have to do in selling run counter to a lot of things you care about as an engineer. For instance, you need to spend a lot of time explaining stuff to people that should be, well, obvious. And most importantly, time spent selling is not time spent creating. The same reasons for hating sales seem to apply to product and design folks.
But I also love sales and not just because it helped pay for my driver’s license. People paying for your product is what enables you to grow your business without giving up (more) equity. And selling is what provides critical feedback about what you should build, making your product better. If you have the right kind of product or service (one with network effects), then selling has the additional benefit of making the product/service better for everyone. Finally, selling is about educating users who otherwise wouldn’t know how or why to use the product or service.
Unfortunately, I see all too many product and/or engineering led organizations that are in thrall to their hate or disdain or at a minimum personal dislike for sales (and by extension sales people). That’s highly unfortunate because it dramatically reduces the overall chances of success for these organizations. And one of the grand ironies is that many of these organizations think that they are in some way emulating Google, which has somehow managed to create a myth that Google got big without sales (nothing could be further from the truth). A similar myth seems to be in the making about Facebook.
Incidentally, I don’t mean sales here just as in having sales people. I also mean selling as in convincing endusers explicitly of the value of a product (ok, so technically that’s marketing but it has many of the same aspects). I haven’t figured out yet how to help people to learn to love sales. But maybe a starting point will be to point at how important selling has been to the success of companies such as Google. And of course selling has been critical to the company currently so beloved by many engineers, designers and product folks alike: Apple.
Tags: selling strategy organization
Thursday, February 2, 2012
Facebook’s Valuation
I was going to write a post about Facebook’s valuation, but Bill Gurley has done such an excellent job, that the better idea is to point at his post explaining “Why Facebook Clearly Belongs in the 10x Revenue Club.” There is one other important point to consider in thinking about Internet company valuations in the current economic environment: low interest rates. Companies that are still growing and have a lot of room for future growth have a fair bit of their value sitting in the future — that’s certainly true for a company such as Facebook. As we are currently in a global deflationary environment (*) the discount rate being applied to these future cash flows is lower than it has been in a very long time and possibly ever. When I started learning about DCF models in the late 80s, a common rule of thumb was to use 7% for the risk free rate of return!
The (*) above is to indicate that we have been expanding the money supply like never before but the lending multiplier has contracted even faster and supply is far outstripping demand in combination resulting in a deflationary environment.
PS If anyone has seen a good analysis of revenue composition for Facebook (advertising versus credits, assuming that’s even disclosed in the S1) please let me know
Tags: facebook valuation ipo interest_rates
Wednesday, February 1, 2012
The Challenge: A Decentralized Rights Registry
Since my post last Friday about alternatives to SOPA/PIPA, I have started to talk to a bunch of people about the idea of a decentralized content registry. Here are some thoughts and questions that I have been kicking around since then.
A decentralized system could offer a new revenue stream for existing registrars and certificate authorities who are already at least partially equipped to deal with issues of verification. By having a competitive situation from the start the price for content registration can be determined by the market rather than set by the government.
We should figure out how to leverage DNS and DNSSEC in this context. The direction in which I am thinking is some kind of analogy to DKIM. Part of what this would require is an efficient way to create signatures even for large pieces of content, such as a feature length movie.
The scheme should probably specifically *not* support DRM and the idea of windowing either by time or geography. Why? Because those are exactly the types of artificial scarcity that piracy exploits. Instead all the content in the registry should be unencumbered and even externally cacheable.
Here then are some of the big questions to consider:
1. If the registry allows content owners to set a price (and likely a separate price for download versus streaming), then how does that money get remitted to the content owners? And how/where does usage get metered?
2. What is an efficient method for discovering copies that are not participating in the scheme? One idea here would be that content that participates in the registry gets fingerprinted. That would enable third parties to build services that report (presumably unsigned or mis-signed) copies that match the fingerprints. To that end it might be worthwhile considering a NIST competition for a publicly available content fingerprint technology (similar to NIST’s hash function competition).
I would love to hear from people who have thought about content rights registries before and/or work with some of the existing centralized ones. For this to work at Internet scale whatever solution comes out cannot be centralized.
Tags: copyright registry dns
Tuesday, January 31, 2012
Tech Tuesday: HTTP
Today we are continuing on with the web request cycle. After the browser has parsed the URL and obtained the IP address of the server via DNS, the browser now has to communicate with the server. That is done using the so-called Hypertext Transfer Protocol or HTTP for short. The beginnings of HTTP go back to the early 1990s when Tim Berners-Lee first devised it drawing inspiration from Ted Nelson, who had coined the term Hypertext in 1963. For an even earlier description of a similar idea it is worth reading Vannevar Bush’s amazing “As We May Think” from 1945!
HTTP builds on top of the lower level Internet protocol TCP which permits establishing a connection between two machines (see my introduction to networking). A so-called HTTP session consists of a series of requests from the browser followed by responses from the server. Each request consists of a request method, a resource (URL), a set of headers and optionally a request body.
The most common HTTP request methods are verbs such as GET, POST, PUT and DELETE (I am capitalizing them because that’s how they appear in the protocol). What’s great about these is that they are wonderfully descriptive of what you expect the request to do. GET is supposed to, well, get information from the resource. I say resource rather than server because that is the right level to think about with regard to HTTP — it is about manipulating abstract resources rather. PUT on the other hand puts information at the resource (without regard to what’s already there). DELETE — you get the idea — deletes the information at the resource. This relative obviousness and some associated expectations around how these methods behave provides a powerful foundation for the transfer of information (more on that in a future post on so-called RESTful APIs).
The headers contain additional information about the request. For instance, the “Date” header field contains the date and time when the request was sent. Or the “Referer” header (misspelled in the protocol and in most implementations!) contains the URL of the page on which the currently requested resource was found. It is worth looking at the list of possible HTTP headers, which also shows the headers for a response (see below). It should be pointed out that the HTTP protocol allows for the creation of additional headers which can carry custom information (not always what you would want as in the recent case of O2 sending users’ phone numbers!).
The request body is used for POST and PUT requests to carry the data. For instance when you encounter a registration form on the web that asks for your name and email address, the information you type into the form fields is (generally) carried in the body of the resulting HTTP request.
Once the server has received and processed the request it will send an HTTP response. The response has a structure that’s quite similar to the request. Instead of the method, the server returns a status code, then some headers, and finally a response body.
The status code indicates what happened at the server and hence what to expect in the body of the response. The standard code is “200 OK” which means the server processed the request and everything went well. There are more precise responses in that vain, such as “201 Created” which means the server created a new resource (e.g. in response to a PUT request). There are a series of codes to deal with resources that have moved, such as “301 Moved Permanently” which provides a new URL that should be used for all future requests for this resource. And there are a bunch of codes to indicate various error situations such as the famous “404 Not Found” for which some web site return very funny contents in the body of the response. Again, it’s worth browsing the complete list of response codes.
The response headers contain a lot of additional information about the response. For instance the “Content-Type” header field describes what kind of content the response body contains. Examples of different values for this header field are “text/html; charset=utf-8” for a web page in HTML using the UTF-8 character set or “image/jpeg” for an image that is compressed using JPEG. Without knowing this the browser would have to infer the content type from inspecting the body of the response which would be very cumbersome. There are a ton more headers in a response that are similarly critical to the proper functioning of the HTTP protocol, such as how long a recipient can “cache” (locally store) the body (in order to help speed up a subsequent access and also relieve the server and network).
Finally there is the body of the response which contains the actual information. The body is a bunch of bytes. What they represent can vary wildly as explained above. It could be an HTML web page or an image or something altogether different. One of the great powers of the HTTP protocol is that it is really content agnostic.
Because there is a lot going on with the HTTP protocol under the hood and much of it matters it is a bit of a shame that many people including active developers don’t really understand it and as a result either create things that don’t work as expected (e.g. making resources change in response to a GET request) or re-invent features on top of HTTP that HTTP already contains (e.g., content caching). If you do any work on the web it is well worth digging deeper than this post!
Tags: tech_tuesday web http
Monday, January 30, 2012
ReRAM: An Exciting Hardware Innovation
In my Tech Tuesday posts I have covered main memory and storage (by the way, coming up tomorrow: HTTP). If you have read those or otherwise follow hardware, then you will find this short piece from BBC Technology News on a new technology known as ReRAM quite interesting. Essentially, ReRAM holds the promise of providing non-volatile storage at the speed of memory.
That would provide a major breakthrough for database applications. Not only is ReRAM even faster than the Flash memory used in the SSDs (which are currently replacing traditional disk drives for high end database applications) but it obviates the need for going to disk in the first place. That means the whole intermediate software layer that controls disk access falls away as well.
What is amazing is not just that this is possible at all, but also the history of this technology. ReRAM is based on something called a Memristor which was invented as a theoretical possibility in 1971 (much after the Transistor which was invented in 1947). Then it took until 2008 to build a Memristor. From that breakthrough by HP it seems that commercial products may be available as early as next year!
Tags: hardware innovation
Friday, January 27, 2012
Thinking About Alternatives to SOPA/PIPA
With SOPA and PIPA shelved at least for the moment, it is time to start thinking about alternatives. It would be a shame if we limited our collective thinking here to slightly different versions of those bills instead of exploring what a different approach to copyright could be that doesn’t try to fight the characteristics of the Internet but rather embraces them, providing value for rights creators/holders, technology companies and endusers.
One interesting entry here is Ian Rogers (from Topspin Media) proposal for a rights and media registry. It’s worth reading the entire post and also the comments, which include good questions from Andy and clarifying answers from Ian. In essence such a registry would enable tech companies to deliver innovative user experiences on top of content, as long as they respect the prices set by the rights holders. Rights holders would be entitled to enforcement only if they participate in the registry.
I believe this direction is very promising and is also something that was recommended by a report that the UK government’s copyright office had commissioned. An important addition though would be that this should not be a centralized registry (which then requires an operator and become a single point of control and failure) but rather a standard for publication that would allow for a decentralized implementation.
Tags: sopa pipa copyright
Thursday, January 26, 2012
Apple Is Slow Boiling Developers
How do you boil a frog? Slowly. Apparently the same is true for endusers and even software developers. That at least is what Apple seems to believe. And while this has been debunked for frogs (they do jump out as the water gets too warm), it’s not clear that the same is true for humans. We seem all too willing to trade off having a shiny device for accepting ever more restrictions on what we can do with that device.
I wonder how long it will take before people realize how much they are losing when instead of a general purpose computer they have a locked down device controlled by a central choke point. I am especially curious when developers like Marco will conclude that this is no longer in their interest. And I am fascinated to see Gruber write a long post arguing that Apple’s new ebook “standard” is not a classic case of embrace, extend and extinguish. What line of control does Apple have to cross for him to say it’s actually a step too far?
The latest tightening of control by Apple is making some APIs accessible only to applications sold through their store. I am not talking about apps for the iPhone or iPad here but applications for laptops and the Mac Mini. You can read more about it here. This whole direction is rather upsetting because I really like my MacBook. But I don’t enjoy being boiled, not even slowly.
Tags: apple control general_purpose_computing
Wednesday, January 25, 2012
Supermodularity And Service Bundling
This will be a bit of a wonky and short post with a longer and less technical one to follow some time soon. Google has just announced a coming update to their privacy policy which will essentially make it possible for Google to integrate all the information it has about a user across its many different services. This comes at the same time as the revelation that Larry Page apparently explicitly stated the goal of building “a single unified, ‘beautiful’ product across everything.”
While one can come up with many possible verbal explanations for why Google might want to go this direction, there is some powerful math that lies at the heart of it: supermodularity. Here is the definition:
A function

is supermodular if

for all x, y
Rk, where x
y denotes the componentwise maximum and x
y the componentwise minimum of x and y.
If a production function is supermodular then x and y are strongly complementary. If you want to read the bible on this consult Don Topkis “Supermodularity and Complementarity.”
A firm such as Google for which the production function relies almost exclusively on information (yes, there are servers and people as well) will exhibit super modularity almost by definition. Why? Because if X and Y are different information vectors, then as long as they carry some joint signal, the inequality will be met as you can always choose to discard additional information (meaning you always have access to the component wise minimum). In plain English: if you have access to both the search history (X) and the social graph (Y) of a user, you can always “do better” than two separate services that only have access to one of these respectively.
Tags: wonky economics google
Tuesday, January 24, 2012
Tech Tuesday: DNS
Today we are continuing with the web cycle that I outlined two weeks ago. After a URL has been parsed in Step 1, the browser needs to determine the IP address for the domain as Step 2. Reprising the previous example, let’s consider the domain name dailylit.com. How does the browser determine that in order to retrieve information from this domain it should access a server at IP address 72.32.133.224? This is accomplished via a system called DNS, which stands for Domain Name System, and provides essentially the equivalent of a telephone book which provides IP addresses (telephone numbers) for domain names (people names).
In ARPANET, the predecessor to the Internet, there were so few domain names that this telephone book was simply a file called HOSTS.TXT that was retrieved from a computer at SRI and stored locally. There were only a few domains (mostly universities) and the file was relatively short. Today on the Internet there are over 200 million domain names of the type dailylit.com, which are further subdivided through subdomains such as blog.dailylit.com. So the idea of having every computer maintain a complete and up-to-date copy of the telephone book locally doesn’t make sense any more.
Thankfully in the early 1980s, which depending on your perspective is either ancient pre-history or not that long ago, DNS was born as a service that would allow the registration of domain names and maintain a mapping between the names and IP addresses in a robust fashion. In fact, without DNS it would be hard to imagine the Internet having grown as dramatically and we probably wouldn’t have nearly as many domains to begin with.
There are many ingenious ideas in the design of DNS and I won’t be able to cover them all here. Instead, I will focus on some key concepts. The first and central one is that there is a hierarchy of authority which allows for the delegation of both registration of domain names and the lookup of IP addresses. The hierarchy starts with the 13 root servers which together make up the so called root zone from which all authority flows. It is here that the so-called Top Level Domains or TLDs get resolved. Going back to blog.dailylit.com, the TLD is the “.com” part. You can think of a domain name like nested Russian dolls, where the outermost doll, the TLD, is the rightmost part of the name.
The most common TLDs are .com and .net which together account for about half of all domain names. There is of course also .org, .gov., .edu and an ever increasing number of other TLDs such as most recently .xxx. And then there are TLDs for countries which all consist of two letters, such as .uk for the UK (duh) or .ly for Libya, popularized by bit.ly, and .us for the US, which made the domain del.icio.us possible. Each TLD has one or more registrars associated with it who are in charge of letting people and companies reserve names in that domain.
The root servers point to name servers for each of these TLDs. Since blog.dailylit.com is in the .com domain the next place to look is the .com name servers. The .com name servers in turn point to the name servers for dailylit.com itself. Currently those name servers are at Rackspace. Since Susan and I registered and control dailylit.com, we are the ones who get to decide which nameservers should be queried to find the IP address for dailylit.com and its subdomains, such as blog.dailylit.com. The way this generally happens is by logging into a system run by a registrar and setting which nameservers are to be the authoritative sources of IP addresses for the dailylit.com domain. That then gets recorded in the nameserver for the corresponding TLD.
The lookup process that started with the root, went to the .com TLD, is now at the dailylit.com nameservers at Rackspace. They in turn contain information on dailylit.com itself and its subdomains, such as blog.dailylit.com. The whole process of starting at the root and working towards the subdomain (right to left) in a series of separate lookups across different servers is called a “recursive lookup.” If this sounds complicated to you, that’s because it is. It is so complicated and resource intensive that we don’t want the web browser to have to do this each time it encounters a domain name. It would not only be slow, but it would also swamp the root servers, the TLD servers and possibly even the name servers for dailylit itself.
So instead of doing a recursive lookup every time, the results of these lookups are stored on so called DNS cache servers. For instance, most ISPs through which you access the Internet will operate their own cache servers. After they have looked up blog.dailylit.com once, these servers will “cache” (meaning temporarily store) the result of the lookup, thus providing a much faster lookup the next time. In fact, your own computer will often cache the results of lookups locally for super fast access. This is important both because even a single web page generally involves multiple requests (e.g. for images) to the same server. The duration for which the results of a recursive lookup can be cached locally is known as the Time To Live or TTL and is controlled by the owner of the domain (and generally honored by the cache servers).
The existence of cache servers (sometimes also referred to as non-authoritative servers — although technically not exactly the same) provides a critical security vulnerability for DNS. Let’s say you have gone to your favorite coffee shop and logged on to the WIFI network there. Where do your domain lookups go? Well to the cache server of whatever ISP the coffee shop uses or possibly even cache servers on the coffee shop’s own network. An attacker with access to those local cache servers could insert falsified records that could have the effect of say pointing chase.com to some rogue server that wants to steal your bank username and password. This would allow for a so-called man-in-the-middle attack (more on this in a future post). Fortunately, some security additions to DNS known as DNSSEC will in the future prevent these kinds of attacks. As more and more of our access to the Internet is over wireless networks this becomes particularly important.
If you made it this far, I hope you have a (newfound) appreciation for the complexity of a system that is used billions of times per day behind the scenes of nearly every access to the Internet. In addition to the technical issues there are also important political issues surrounding DNS. Most recently the proposed SOPA and PIPA legislation would have mandated nameserver operators to make changes that would have interfered with the implementation of DNSSEC. Then there is also the question as to who really controls the root zone which turns out to be the US Department of Commerce. Yes, for the *entire* Internet, which is all the more reason why we should make DNS better not worse.
Tags: tech_tuesday web dns
← Older Entries