Last year’s Tech Tuesday ended with a first look at computational complexity. We saw how crazily long a brute force attempt to find a 40 character string would take. This was the result of exponential growth in the runtime of our algorithm. Today we will start to dig a bit deeper.
Let’s consider a very common problem: checking whether a word is part of a group of words. For instance, if you use the “find on page” feature in your browser you can check if this page contains the word “freezing” (for historical context – it is super cold in New York today). Let’s start with the simplest possible approach: look at the words one by one until we have either found the word or gone through all the words. Now the best case scenario here is that the word is the first word on the page. In that case we need only a single comparison. The worst case is where the word is not on the page at all and we need to examine every single word before figuring that out.
Let’s formalize this a bit. Assume that there are n words on the page. Then our best case complexity is 1 and our worst case complexity is n. So our best case is a constant and our worst case is linear. Both of these are useful to know but we might also be interested in average case complexity. There is no simple obvious answer though for that as it depends on the probability distribution over possible words people might look for.
For instance, consider the case where 50% of the time the word is not on the page at all and 50% of the time it is in a random position uniformly between the first word and the last word. Then our average complexity looks something like this: 0.5n + 0.5 * 0.5 (1 + n) = 0.75 * n + 0.25. In fact, a realistic average case analysis of this algorithm would be even more complicated because we need to consider not just the distribution of the word that someone is looking for but also different possible groups of words on the page.
Now I conveniently glossed over an important aspect of this problem. We can’t really compare two words in a single operation. Instead, we need to compare them one character at a time. So let’s say that the word we are looking for has k characters. Then to compare it with one of the words on the page we need to check at least 1 character (best case, first character is already different) up to k characters (worst case, either the word is the same or the last character is different). So really to get the right complexity for our algorithm we need to multiply the earlier numbers by k/2.
And here is another important wrinkle: are we looking for a word just once or many times in the same group of words? If the latter, then it turns out we can do a bunch of work ahead of time on the group of words to massively speed up the subsequent lookup. In fact, as we will see we can make this problem be better than linear in k (that’s the number of letters in the word we are looking for and shockingly independent of n, the number of words in our group)!
The takeaway from the end of last year and today should be that this is a very rich set of questions to examine. Over the coming weeks we will look more deeply at many of these. Such as, just how much worse can worst case be than best case (hint: a lot)? What is the asymptotic behavior of an algorithm as n grows large? How do we amortize initial preparation effort over subsequent use? Are there problems that are intrinsically hard or is always just a question of finding a better algorithm?