Tech Tuesday: Taking a Break

With last week’s P=NP? Tech Tuesday will take an indefinite break. First, we are heading to Africa for the rest of March (longest vacation in 20+ years). Second, when we come back I will be using Continuations to significantly elaborate on the ideas brought up first in my post on Entering the Information Age and then in my talk at DLD on The Big Questions about the Future. I am looking forward to this new phase of Continuations and promise to return with renewed energy.

Posted: 11th March 2014Comments
Tags:  tech tuesday break

Tech Tuesday: P = NP?

Last Tech Tuesday, we looked at complexity classes which group problems by their intrinsic difficult. There we encountered one of the great unsolved questions of Computer Science. Is P = NP? The class P is easy to define. It is the set of all problems that can be solved by a Turing machine in polynomial time, i.e O(n^k) for some k. So you might think that NP stands for non-polynomial time but you’d be wrong. Instead the class NP has the  same definition as P except for a non-deterministic Turing machine.

You may recall from the posts on Turing machines that for each state of the machine and symbol on the tape there is exactly one action to be taken (which results in a new state and possibly a symbol written to the tape and/or movement of the tape head). In a non-deterministic Turing machine there are multiple possible actions and the machine is assumed to be taking each of them simultaneously!

Now it is easy to see why such a machine could be very fast. Take the example that we used to motivate the study of complexity: guessing a string drawn from the letters A, G, C, T. We saw that at length 40 this would take 400,000 years for my MacBook. But a non-deterministic Turing machine could do it very quickly because it can essentially guess A, G, C, T simultaneously for each position. The running time for this machine would be O(n) for n being the number of letters in the string.

But wait, when I had introduced Turing machines I had made the claim that they are the most powerful model of computation. This non-deterministic Turing machine sure sounds more powerful. Yet, there is nothing that a non-deterministic Turing machine can compute that cannot also be computed by a normal one. How do we know that? Because we have a proof that shows that a non-deterministic Turing machine can always be simulated by a (normal) Turing machine.

Now my description of using a non-deterministic Turing machine already contained the seed for a different definition of the class NP. It is also happens to be the class of problems where we can verify a positive result in polynomial time. Consider the following problem. Given a set of integers, e.g. {-17, 2, 3, 12, -5, 21} is there a subset that adds up to 0? To answer this question an algorithm might proceed as follows: while there are subsets left, pick a subset and test if that subset adds up to zero and if it does report “yes” — if we exhaust all subsets report “no”. Now for a set with n integers in it, there happen to be 2^n subsets, so our loop might run for a very long time as n gets large.

If we focus on just the verification bit for a moment though, it is trivial. To verify if a subset adds up to zero, all we need to do is add up the numbers in that subset. The biggest possible subset has n elements (the set itself) and we clearly see that the verification step is linear in n, i.e. O(n). We call a subset that adds up to 0 a proof or witness because it shows that the answer to the question is yes. We see here a fundamental asymmetry. In order to answer yes, all I need to do is supply a single witness. To answer no, I need to show that no such witness exists!

We also see how this maps nicely to our distinction between normal and non-deterministic Turing machines. The normal one needs to crank through our code as outlined above. The non-deterministic one gets to work on all the subsets simultaneously. A great many problems have this structure. For instance, if give you an integer n and a second smaller integer f you can easily verify if f is a factor of n. But if n is large there a many candidate numbers. On DuckDuckGo you can see a long list of such problems.

So now back to our original question. Is P = NP? Based on everything we have just discussed it would intuitively appear that NP should contain at least some harder problems that are not in P. That is how the complexity class diagram from last week is drawn. But we don’t know that with certainty. To date there is no proof and some people doubt there will ever be one. In the absence of a proof there is some possibility that for every problem in NP we eventually find an algorithm that executes in polynomial time on a normal Turing machine.

This turns out to be a question of potentially great practical implications, especially in the area of cryptography. Public key cryptography relies on the asymmetry in which it is easy to verify that a message was encrypted (or signed) using someone’s private key when you have their public key. But all of that would be meaningless if you can easily derive someone’s private key from their public one (this happens to be related to the factor problem I mentioned above). Ditto for bitcoin. Finding a number to complete a block is hard. Verifying (by hashing) that the number does complete the block is easy. Again, it is hard to overstate just how many problems fall into this category.

Posted: 4th March 2014Comments
Tags:  tech tuesday theory complexity p vs np

Tech Tuesday: Complexity Classes

The last few weeks in Tech Tuesday we have looked at various aspects of computational complexity. So far whenever we have looked at a sample problem, such as finding words on a web page, we have studied the characteristics of a particular algorithm for solving the problem (in many cases starting out with looking at a brute force solution). But we have skirted the central question of computational complexity: how hard is the problem? Or asked differently, for a given problem, what is the best we can do?

Much of the really hard theoretical work that has gone on in computational complexity centers around assigning problems to complexity classes. A class is defined roughly as:

the set of problems that can be solved by an abstract machine M using O(f(n)) of time or space, where n is the size of the input

where I have linked to the previous posts describing the relevant concepts. The basic idea is that we represent computation by abstract machines (which are analyzable) and then determine how much time or space is required by the best possible solution on that machine. While this is easy to state, finding best solutions and proving that they are best is hard. As we will see in just a second it turns out to harbor one of the great unsolved problems of computer science.

This definition of complexity classes gives rise to a hierarchy which is shown in the following picture (taken from a Penn State class on the theory of computation):

image

What does this mean? Let’s start at the outermost class of all recognizable problems, also known as decision problems. In our study of the halting problem, we have seen that some problems are simply undecidable. So that’s why in this chart the class of decidable problems is contained strictly inside the class of all problems. This is something we know and has been proven.

Now you see lots more complexity classes nested inside of the decidable problems, such as EXPSPACE, which is the set of all problems decidable by a Turing Machine in O(2^p(n)) tape space where p(n) is a polynomial of n (which is the size of the input). Now the diagram shows this as being strictly larger than EXPTIME, which is the set of problems decidable in exponential time. That, however, so far is only a conjecture. We believe that there are problems that are in EXPSPACE but not in EXPTIME but we don’t have a proof of that.

We do, however, know that EXPSPACE is a strict superset of PSPACE. By now you should be able to guess what the definition of PSPACE looks like: it is the set of problems decidable by a Turing machine in O(p(n)) of space.

Here is something else we know and have already encountered in Tech Tuesday. We saw that the class of regular problems, which can be solved using finite state machines, is strictly smaller than the class CFL, which are the context free languages.

So we have a fair bit of knowledge at the outermost layers of this hierarchy and also at the innermost. But there is a huge gaping hole in the middle where a great many problems reside. In particular you will almost certainly have heard about the question whether or not P = NP. This is considered to be the central unsolved question in computer science. We will take this question up next Tuesday!

PS There are many more complexity classes than are shown in the diagram above. You can find a comprehensive listing in the Complexity Zoo.

Posted: 25th February 2014Comments
Tags:  tech tuesday complexity

Tech Tuesday: Amortization

The last few Tech Tuesdays have all been about computational complexity. One of the problems that we looked at a couple of times was finding a match (eg a word on a page). And for good reason as matching is a very common real world task. One important practical consideration that I mentioned briefly before is: are we doing a one-off match or will we run many matches against the same data?

Why is this important? Because if we know that we will do many matches or lookups we can do some or a lot of work upfront in return for way faster subsequent performance. Today we will make that a lot more concrete. Consider the problem of finding whether a word of length k is contained in a text of n words. A brute force approach would compare one word at a time and would result in growth of O(k * n), which if we think k is always relatively small would be O(n).

But there is an alternative. We could first process the page and build a tree structure as follows: starting at the root node, there is a branch for every letter that a word in the text starts with. For instance, if we do this for the preceding paragraph there would be links from the root for w, i, t, b, k, d, m, o, l, c, s, a, r, f, p.

Then at the second layer, there is a branch for every subsequent letter that actually occurs. So for instance from the w node there would be links for h, e, i, o, a. And so on with special terminal nodes marking the ends of words (eg from root -> w -> e -> end — if the text contained “week” also, the following path would also exist root -> w -> e -> e … hence the need to mark the end of words explicitly).

Once we have built this tree we can look up words in O(k) or for known small k it would be O(1)! That is a huge performance gain. But of course we bought this gain for the cost of building the tree. How big is that cost? Well it is somewhere around O(k * n). So if we have are only ever doing a single lookup we have gained absolutely nothing.

But now consider the case where we do m lookups where m grows large. We get to “amortize" the initial cost over every lookup. Let’s assume again for a moment that k is small / constant. If our number of lookups m grows as fast or faster than n (our number of words), then the setup cost adds only a constant to our cost.

This kind of analysis will frequently show that indexing (ie building some kind of lookup structure) is an incredibly powerful way of speeding up subsequent lookups. Many real world performance problems can be solved that way. Next week we will look at some fundamental barriers though to how fast things can go.

Posted: 4th February 2014Comments
Tags:  tech tuesday complexity performance

Tech Tuesday: Growth Behavior (Big O Notation)

Today’s Tech Tuesday will let you in on a bit of computer science secret language. Sometimes when discussing a solution to a problem an engineer will say something like “that’s too slow — it’s O of n squared” (that’s the letter “O” not a zero). What does it mean? Well it is a shorthand way of describing the growth behavior in the time or space required by an algorithm as n (the number of things we are processing) grows large. More formally it is known as “Big O Notation.”

As we have seen in the last few posts on computational complexity, for small n the differences between algorithms may be relatively small but that can change dramatically as n grows. The mysterious sounding Big O notation is a relatively simple way of formalizing the idea that the relevant growth behavior can be captured entirely by its dominant term. Let me explain using an example.

Suppose algorithm A takes 1,000 steps (a constant) and algorithm B takes n + 20 steps. For small n, say n = 20, B will be much faster than A. But for any n > 980, algorithm B will be slower and increasingly so. For n = 1 million, B is already one thousand times slower than A. What Big O Notation does is give us a way of expressing that A and B belong to different growth classes or growth orders (hence the “O” as in order). In this example, A belongs to O(1) which is constant “growth” and B belongs to O(n) which is linear growth.

We write O(1) and O(n) as if they were functions themselves but they are better thought of as sets of all functions that share the same growth order. For instance, n + 20, 3n - 5 and 17n all share the same growth order O(n). Wait what? 17n would seem to grow a lot faster than n + 20, no? 17 times as fast to be precise. But this is not what matters. Why? Because it is still grows way more slowly than say O(n ^ 2) as n grows large (that’s n squared).

One relatively easy way to think about the difference is that if your computer gets 17 times faster, then an algorithm that used to take 17n steps becomes as fast as one that took n step for any value of n. But if your computer gets 17 times faster, then n ^ 2 / 17 will still be way more steps than n for large n. This is captured in the formal definition of big O notation which says roughly that f(n) ∈ O(g(n)) if there exist two positive values M and N, such that for every n > N, f(n)  Mg(n).

So what is the growth order of an algorithm that takes n^2 + 20n + 1000 steps? Well it turns out to be O(n^2). It is pretty easy to see why. For any n > 20, the n^2 term will exceed the 20n term. At n = 20 the 20n + 1000 add up to 1,400 and n^2 is 400. So if we set M = 4 that should do the trick. And you can check here to see that in fact for n > 20, n^2 + 20n + 1000 ≤ 4n^2.

That’s why I said earlier that what Big O notation does is reduce everything to the dominant term. The n^2 is the dominant term and the n and the constant don’t really matter as n gets large. There are many different growth orders and we will encounter a variety of them as Tech Tuesday goes on. While understanding order of growth is super important, you should keep in mind  that for some practical problems that you may encounter n may in fact be known to be small and always small. In that case things like large constants may matter (more on that soon).

Posted: 28th January 2014Comments
Tags:  tech tuesday complexity

Tech Tuesday: Best, Average and Worst Case

In the last two Tech Tuesdays, I introduced the topic of computational complexity. Today as promised I want to dig a bit deeper into the question of best, average and worst case complexity. The canonical example used for this is sorting. For simplicity we will assume that we are sorting a list of numbers. Several different sorting algorithms have been developed over the years that go by fun names such as Insertion Sort, Quicksort and even Bubble Sort (here is a past Tech Tuesday three-part introduction to algorithms).

On the wonderful Internet someone has created an amazing web site that illustrates the different sorting algorithms at work. You should head over there now and check it out. You will notice that the site distinguishes between random data, nearly sorted data, reversed data and data with few uniques (meaning there are not that many distinct values). So what’s the best algorithm?

There is no single dominant sorting algorithm and that’s because different algorithms have different best, average and worst case characteristics. For instance, a very widely used sorting algorithm is Quicksort which takes a “divide and conquer” approach. The basic idea is to pick a pivot element and then create two sublists — one with all the elements great than the pivot and one with all the elements smaller. The pivot sits (by choice) right between those two sublists. Each of these lists is smaller than the original list and we can just apply Quicksort to them until we encounter lists of size 1 which are always sorted.

Best case and average case performance for Quicksort both behave as n log n meaning that if you have n items to sort then the growth rate of the time required is roughly n log n. The worst case performance of Quicksort though is n ^ 2. Now you might be tempted to say n ^ 2 doesn’t seem that much worse than n log n but you would be mistaken. Here is a table that illustrates the difference between the two

image

As you can see it starts out innocently enough, but when you get to 100,000 items (which is  not a big number on say the Internet), the worst case is already 20,000 times worse than the average case.

Now here is something particularly intriguing. The worst case for some naive implementations of Quicksort turned out to be an already sorted list! Various improvements have been made to Quicksort over the years and there are sorting algorithms, such as Heapsort, that have n log n as their worst case.

I hope this example already makes it clear why understanding the differences between average and worst case performance can matter a lot. There are examples where the worst case is even worse than above. For instance, the simplex algorithm used for solving linear programs has polynomial time average complexity. But people have been able to construct worst cases that have exponential complexity (and in case you missed it, that is bad, bad news). Thanks to Oliver Friedmann, co-founder of Ziggeo, for pointing out the simplex example to me (Oliver has authored many papers in this field with important results).

If you are still not convinced that this is an important matter, you should think about designing systems for the Internet where you do not control the inputs and where you may be facing adversarial opponents. If your worst case complexity grows too fast you may wind up expending way more resources than you had anticipated. Conversely, in cryptography if you are basing things on worst case but average case turns out to be much simpler you might have system that is easier to break than you imagined.

Posted: 14th January 2014Comments
Tags:  tech tuesday theory complexity

Tech Tuesday: Computational Complexity (Intro Cont’d)

Last year’s Tech Tuesday ended with a first look at computational complexity.  We saw how crazily long a brute force attempt to find a 40 character string would take. This was the result of exponential growth in the runtime of our algorithm. Today we will start to dig a bit deeper.

Let’s consider a very common problem: checking whether a word is part of a group of words. For instance, if you use the “find on page” feature in your browser you can check if this page contains the word “freezing” (for historical context — it is super cold in New York today). Let’s start with the simplest possible approach: look at the words one by one until we have either found the word or gone through all the words. Now the best case scenario here is that the word is the first word on the page. In that case we need only a single comparison. The worst case is where the word is not on the page at all and we need to examine every single word before figuring that out.

Let’s formalize this a bit. Assume that there are n words on the page. Then our best case complexity is 1 and our worst case complexity is n. So our best case is a constant and our worst case is linear. Both of these are useful to know but we might also be interested in average case complexity. There is no simple obvious answer though for that as it depends on the probability distribution over possible words people might look for.

For instance, consider the case where 50% of the time the word is not on the page at all and 50% of the time it is in a random position uniformly between the first word and the last word. Then our average complexity looks something like this: 0.5n + 0.5 * 0.5 (1 + n) = 0.75 * n + 0.25. In fact, a realistic average case analysis of this algorithm would be even more complicated because we need to consider not just the distribution of the word that someone is looking for but also different possible groups of words on the page.

Now I conveniently glossed over an important aspect of this problem. We can’t really compare two words in a single operation. Instead, we need to compare them one character at a time. So let’s say that the word we are looking for has k characters. Then to compare it with one of the words on the page we need to check at least 1 character (best case, first character is already different) up to k characters (worst case, either the word is the same or the last character is different). So really to get the right complexity for our algorithm we need to multiply the earlier numbers by k/2.

And here is another important wrinkle: are we looking for a word just once or many times in the same group of words? If the latter, then it turns out we can do a bunch of work ahead of time on the group of words to massively speed up the subsequent lookup. In fact, as we will see we can make this problem be better than linear in k (that’s the number of letters in the word we are looking for and shockingly independent of n, the number of words in our group)!

The takeaway from the end of last year and today should be that this is a very rich set of questions to examine. Over the coming weeks we will look more deeply at many of these. Such as, just how much worse can worst case be than best case (hint: a lot)? What is the asymptotic behavior of an algorithm as n grows large? How do we amortize initial preparation effort over subsequent use? Are there problems that are intrinsically hard or is always just a question of finding a better algorithm? 

Posted: 7th January 2014Comments
Tags:  tech tuesday theory complexity

Tech Tuesday: Computational Complexity (Introduction)

So far in the Tech Tuesday series on the theory of computer science we have covered questions of computability. We have looked at ways of formalizing computation through abstract machines such as the Finite State Machine and the Turing Machine to explore limits on what can be computed at all. We saw with the Halting Problem that the power to compute lots of things is gained at the cost of not being able to say much about the computation (e.g. will it stop?).

Now we will turn our attention to a different question known computational complexity. We will look at algorithms known to solve a problem and ask: how long does this program run? And how much space will it consume? These questions might have seemed more relevant at a time when computers had 48 KB of main memory and CPU speed was about 0.5 MIPS (the specs of the Apple II on which I learned how to program). Today I am writing this blog post on a MacBook Air with 8 GB of memory (a 160,000x increase) and an Intel Core i7 with over 100,000 MIPS (a 200,000x increase). So who cares about the time or space required to execute a program?

To understand why we still care, consider a seemingly simple problem: guessing a string of length 10 drawn from the letters A, G, C, T. Since there are 4 letters that amounts to 4 ^ 10 different possible combinations, which is the same 2 ^ 20 or about 1 million combinations.  Let’s generously assume it takes only 1 instruction to check a combination then we are looking at 1 million instructions which my MacBook Air could do in roughly 1/100,000th of a second (even if you make the number of instructions more realistic you will still be in the sub millisecond range).

Now consider the very same problem except that you are trying to guess a string of length 40. You are looking at 4 ^ 40 or 2 ^ 80 combinations. Since 2 ^ 10 is about 10 ^ 3, this is 10 ^ 24 checks. Again assuming only 1 instruction per check we are now looking at a running time of 10 ^ 24 / 10 ^ 11 = 10 ^ 13 seconds which is about 400,000 years (or as Wolfram Alpha helpfully points out about 20 times since the last glacial maximum). If you are looking for a string just a tad longer than that you can take up more time than we believe the universe has existed.

Now if you noticed my choice of letters you will realize that I am talking about DNA. A string of length 40 is tiny in the DNA world where even small organisms such as bacteria have north of 10 ^ 6 base pairs. There are many other naturally occurring phenomena where any brute force approach would results in exponentially growing running time. So despite our tremendous advances in hardware we still care a great deal about the running time and space requirements of an algorithm.

In the upcoming posts we will learn a lot more about the time and space characteristics of different algorithms for the same problem and also different types of problems. Along the way we will discover some great counter-intuitive surprises and find out more about the limits of our knowledge. 

Posted: 17th December 2013Comments
Tags:  tech tuesday theory complexity

Tech Tuesday: Halting Problem (Implications)

Last Tech Tuesday, I introduced the halting problem. As you may recall the upshot was that there is no general algorithm that will determine whether a Turing Machine will stop for a particular input. Now you might at first think that this is not a big deal. After all do we really care if a Turing machine stops or not? As it turns out the implications of this are significant including philosophical limits on what is knowable (to humans) and the practical impossibility of building secure computer systems.

To understand why, let’s recall that when I introduced Turing Machines, I mentioned that “to date we have not found a model of computation that is more powerful than a Turing Machine.” Now let’s turn this on its head and ask whether or not most existing programming languages and computer designs are as powerful as a Turing Machine. The requirements turn out to be super simple: all you really need is conditional branching, i.e. testing a condition and doing one thing if it is met and another if it is not (see my post on control structures) and variables (technically infinitely many variables but we will see more about that in the future). Most languages have that and so most languages are what is called Turing complete.

That it turn implies that our halting problem applies to most programming languages and thus there is no general algorithm to determine whether a program will stop on a particular input or not. But the problem is way worse than that. Seemingly trivial questions such as “will this program output the number 42?” or “will this program ever execute line 42 of its code?” can be reduced to the halting problem. Put differently, if there was a general algorithm to determine the results for these questions then it could be used to solve the halting problem. Since we know that the halting problem cannot be solved the reduction shows that these problem also cannot be decided with a general algorithm.

Putting Turing completeness together with the ability to reduce other problems to the halting problem we are faced with a very unsatisfying situation: there really isn’t much we can say about what programs do! One of the dramatic practical implications of this is that computer security is kind of an impossible task. As long as we allow general programs (meaning programs in a Turing complete language, i.e. pretty much any general purpose language) we can’t make sure that they won’t do something nasty once they execute.

Now let’s briefly consider a possible philosophical implication of the halting problem. The universe itself (and even smaller physical processes inside of it) can be interpreted as carrying out computations. The halting problem’s undecidability then suggests that many questions cannot be decided by a general algorithm either. That in turn puts a question mark behind the believe in the singularity to the extent that a self learning machine is still applying a general algorithm.

Posted: 3rd December 2013Comments
Tags:  tech tuesday halting problem

Tech Tuesday: The Halting Problem (Introduction)

The current sequence of Tech Tuesday posts is about the theory of computation. We have already seen that a simple of model of computation such as the Finite State Machine is very useful but has severe limitations in that it cannot even decide a language as simple as all palindromes. We then learned about the Turing Machine as the most powerful known model of computation and saw how a Universal Turing Machine establishes the separation of software from hardware. This leads to a natural question: what are the limitations to what a Turing machine can compute?

Let’s start with a seemingly simple question. Does a particular Turing machine stop for a given input, i.e. initial state of the tape? This seems initially trivial. After all a Turing Machine only has a finite set of states and we saw that this made it possible to answer this very question for any Finite State Machine. But a Turing Machine also has an infinite tape and that’s where the problem lies. Remember that the transition function for a Turing Machine depends on its state and the symbol in the current square of the tape.

Even with a relatively trivial machine of half a dozen states or though the patterns that can develop on the tape as the machine operates can be incredibly complex (and chaotic). It may take years of investigation to come up with a proof as to whether or not a particular machine stops (see for instance the investigation into so called Busy Beaver machines). So that leads to the obvious question: is there a general way of determining for any Turing Machine whether it will stop? This is known as the “halting problem" and turns out to have far reaching implications for the boundaries of human knowledge.

The answer turns out to be no and Turing proved this in his 1936 paper. I am not going to attempt to reproduce the proof here but will suggest the basic idea. The extreme simplicity of Turing Machines lets us enumerate all programs, which means we can assign a number (i) to each program. Similarly every input to a program can be represented by just a number (x). That let’s us reduce the halting problem to hypothetical halting function h(i, x) which is 1 if the program i halts on input x and 0 if it doesn’t. So now we need to show that such a function cannot in fact be computed.

Let’s consider an arbitrary computable function f(a, b) — meaning a function that takes two inputs and returns an output. From this we can construct a new function g(i) as follows. g(i) is 1 if f(i,i) = 0 and undefined otherwise. Since f(a, b) is computable (by assumption), so is g(i) (this is a bit of hand waiving that we will come back to). Computable means that there is a Turing Machine program that computes g. Given enumerability of programs the program that computes g has a number which we will call e.

Now let’s throw the value e into f and compare that to our posited halting function h. It must be the case that either f(e,e) is 0 or it is not. If f(e, e) is 0 then g(e) is 1 by construction. But since g halts, it would have to be that h(e, e) is 1 (which is different from f). Now conversely if f(e, e) is not 0, then g(e) is undefined, again by construction. Undefined here is implemented by an indefinitely looping program which means that h(e, e) = 0 (again different from f). Recall that f was an arbitrary computable function and yet h (evaluated at e) is different from every f. That shows that h cannot be computable.

This is reminiscent of Cantor’s diagonalization proof for why there are more real numbers than integers. The diagonalization here becomes possible because we have a model of computation that lets us assign a number to each program. And because we can feed that number back to a program (including to itself) we have a way of formalizing what it means for one program to operate on another. This is the true power of Turing’s formulation — it connects programming deeply and elegantly to mathematics.  We can now use one to analyze the other and in the process see the limitations of both. Next Tuesday we will investigate some of the philosophical and practical implications of the non-computability of the halting problem.

Posted: 26th November 2013Comments
Tags:  tech tuesday halting problem turing

Older posts