Subscribe to Continuations to receive new posts directly to your inbox.
Over 100 subscribers
The last few Tech Tuesdays have all been about computational complexity. One of the problems that we looked at a couple of times was finding a match (eg a word on a page). And for good reason as matching is a very common real world task. One important practical consideration that I mentioned briefly before is: are we doing a one-off match or will we run many matches against the same data?
Why is this important? Because if we know that we will do many matches or lookups we can do some or a lot of work upfront in return for way faster subsequent performance. Today we will make that a lot more concrete. Consider the problem of finding whether a word of length k is contained in a text of n words. A brute force approach would compare one word at a time and would result in growth of O(k * n), which if we think k is always relatively small would be O(n).
But there is an alternative. We could first process the page and build a tree structure as follows: starting at the root node, there is a branch for every letter that a word in the text starts with. For instance, if we do this for the preceding paragraph there would be links from the root for w, i, t, b, k, d, m, o, l, c, s, a, r, f, p.
Then at the second layer, there is a branch for every subsequent letter that actually occurs. So for instance from the w node there would be links for h, e, i, o, a. And so on with special terminal nodes marking the ends of words (eg from root -> w -> e -> end – if the text contained “week” also, the following path would also exist root -> w -> e -> e … hence the need to mark the end of words explicitly).
Once we have built this tree we can look up words in O(k) or for known small k it would be O(1)! That is a huge performance gain. But of course we bought this gain for the cost of building the tree. How big is that cost? Well it is somewhere around O(k * n). So if we have are only ever doing a single lookup we have gained absolutely nothing.
But now consider the case where we do m lookups where m grows large. We get to “amortize” the initial cost over every lookup. Let’s assume again for a moment that k is small / constant. If our number of lookups m grows as fast or faster than n (our number of words), then the setup cost adds only a constant to our cost.
This kind of analysis will frequently show that indexing (ie building some kind of lookup structure) is an incredibly powerful way of speeding up subsequent lookups. Many real world performance problems can be solved that way. Next week we will look at some fundamental barriers though to how fast things can go.