Last Tech Tuesday was now two weeks ago due to Hurricane Sandy. As a quick refresher, we looked at where the word algorithm comes from, saw that an algorithm is similar to a recipe and gave an example of a simple algorithm for finding a repeated number in a sequence of numbers. Today we will take a deeper dive on this particular algorithm and see at least one specific implementation. All of this is still part of a by now fairly long running series of posts on programming.
So revisiting the algorithm from last time it was supposed to detect whether a sequence of integers contains the same number at least four times in a row. I presented a candidate algorithm and then pointed out that it still had a mistake. Several people sent me email correctly identifying that the algorithm didn’t actually stop once it found four numbers in a row. Here then is the corrected version:
Put the numbers in a data structure Set counter to 1 Get the first number (which will be our previous number) While there is a next number *and* the counter is less than 4 Get the next number If the next number is the same as the previous number increment the counter Else Set counter to 1 Use the next number as the new previous number If the counter is 4 Display that a four times repeated number was found Else Display that no four times repeated number was found
Before we go on and turn this into code in a bunch of different programming languages there is a lot that can be learned just from looking at this algorithm so far.
First, we are solving a very specific problem with this algorithm. We are simply determining whether or not a number is repeated four times. We are not even displaying which number it is (how would you do that?) or where in the sequence it was found. Nor would we discover if there was a second sequence (because we stop after the first one). And for the case of finding multiple repeated sequences what about overlapping ones? E.g. does 3111115 contain one sequence of four 1s followed by a single 1 or two overlapping sequences of four 1s as in “31111” and “11115”? So it is always extremely important to understand the details of the problem because they may heavily influence the algorithm that works.
Second, we are doing no testing of inputs or examination of edge cases here. If instead of numbers the input that was supplied was a sequence of words our algorithm would not detect that. Sometimes it is OK to assume that the inputs will take the form stated in the problem. In other domains input validation must be part of the algorithm for it to succeed. As far as edge cases go, what about a sequence of length zero (ie containing no numbers at all)? How would our algorithm perform there? As we will see for our currently specified version there may be a problem (can you spot it?).
Third, even though we stated at the beginning to put the numbers into a data structure, we never specified which one. Instead we simply assumed that whatever data structure it was, it has the capability of determining whether there is a next number and if so giving us that next number. In fact, this algorithm would even work with an “online” stream of numbers. When we say “online” in an algorithm we generally mean as opposed to “batch,” meaning where we have all the numbers at the start. If we were being fed the numbers one by one from the keyboard or some data source (such as a sensor at a production line) the algorithm above would work. This is an example of the power of abstraction. We don’t need to know the specifics of the dat structure just that it supports specific operations.
Fourth, we can see from this algorithm without needing to implement it in any particular language that it is quite efficient. As we will discuss at some point in the future what we are particularly interested is the time and space efficiency of algorithms. By time we mean how long it takes to rund and by space we mean how much storage it consumes. In the example above we are going over the sequence of numbers only once (and less than that if a repetition is found). We say that the algorithm has linear running time, meaning that the increase in time is linear to the increase in the size of the problem. It also takes up very little space because in addition to the sequence itself it only has two additional variables.
Fifth, how did we come up with this algorithm? It seemed rather ad hoc — we simply tried something. In general that’s not a good approach to finding algorithms. Usually we try to determine what class of problems our specific problem belongs to and then select from one of several known algorithms instead of coming up with our own. As we will see in a subsequent post, this problem can be seen to belong to a class known as “pattern matching.” We are looking for a pattern of four repeated numbers. There are a variety of different pattern matching algorithms and we can draw on them to solve the problem. Using a known algorithm means that maybe someone else has already implemented it (or possibly it is even part of the programming language) and is available as open source or as a library.
All of these five points are important to take into consideration whenever you think about algorithms and I am planning to explore them in more detail at some future point. In the meantime though, as long promised, here finally is the first implementation of the algorithm in Python:
# numbers = [1, 15, 7, 3, 8, 8, 19, 5, 12, 8, 2, 6, 8] numbers = [1, 15, 8, 8, 8, 8, 7, 3, 19, 5, 12, 2, 6] counter = 1 prev = numbers.pop() while len(numbers) > 0 and counter < 4: next = numbers.pop() if prev == next: counter = counter + 1 else: counter = 1 prev = next if counter == 4: print "Found four times repeated number" else: print "No four times repeated number found"
One of the reasons I picked Python first (and also one of the reasons why people like Python) is that the code looks very clean and in fact amazingly close to our “pseudo code” version of the algorithm. The only thing that takes maybe a bit of explanation is that in the while loop we test for len(numbers) which gives us the length of the list of numbers. If that’s greater than 0 than it means there are still numbers for us to look at.
Another nice thing about this Python implementation is that because of the way Python handles data types, this code actually works not just with numbers but also with say words (as you can see here). Some will consider this a feature and others a bug. Speaking of bugs, I should point out that there are two things wrong with this implementation. One is that it doesn’t handle the edge case of an empty list of numbers (how would you fix that?). The other is not technically a bug but it does something that’s quite different from how you may have envisioned this algorithm working (here is a hint). Consider both of these as tests. Next time we will look at implementations of this same algorithm in other languages.
This is a continuation of yesterday’s post. Those of us who have been around computers for a long time have lived through many letdowns when it comes to AI. The promise was frequently grand (cf various Herbert Simon quotes) but the actual results were disappointing to say the least. But a funny thing has happened over the last decade on the web. The grand promises have (mostly) disappeared and yet the results are beginning to show up! We might not be noticing them because they are not even labeled as AI, but I believe that they qualify.
A simple example is Netflix's recommendation system. Based on your ratings of movies, the sytem recommends other movies that you might like. It does so with fairly good accuracy already and several teams have been working on significant improvements. In my book this is a type of AI — the machine is doing something that if a human did we would consider to be intelligent. I belong to the “if it walks like a duck and talks like a duck, then I don't care if it's mechanical” school of AI.
One can easily see how the Netflix recommendation system could one day become a subsystem of the kind of AI that we have envisioned in the past. For instance, as a first step, if you were to combine it with all the movie knowledge contained in IMDB, then you would have something that could easily pass as an expert on movies. Yes, it still could not watch and understand a movie, but I don’t think that’s a requirement. Certainly plenty of humans hold opinions about movies that they have not seen solely on the basis of reading someone else’s review.
Why is this happening now? The answer is “It’s the data, stupid.” AI had been madly focused on algorithms, but algorithms without data are like fish without water — they don’t swim very far. What the web has done is put a wealth of data out there that algorithms can make use of. When there is a trade-off, it appears that more data beats better algorithms (of course better algorithms with more data will always be superior).
This is of course one of the all time great (geek) topics: who will rule in the future — humans or machines? I was reminded of it by discussion of Techmeme's story selection. It should not be surprising that a single algorithm — in Gabe’s case link analysis — is not always successful at picking out the important stories. One of the things that makes our human brains so powerful is that they can more or less in parallel apply many different algorithms and synthesize the results. When we look at a tech story, for instance, we can quickly relate it to past stories, we almost instantly infer its sentiment, we triangulate with our perception of the source, etc.
I have a few believes about where we are at now:
- First, to improve automated news systems we have to combine many different algorithms. Link analysis, semantic analysis, sentiment analysis, and so on. Doing so is not easy because it requires a probabilistic framework that can process the inputs from multiple algorithms.
- Second, even once we have done that, we are likely to still fall short of those human algorithms that are based on deep understanding. But we may do well enough to let individual readers take that last step rather than editors.
- Third, in order for the second point to apply, I believe we have to change our attitudes toward automated systems a bit. This will happen naturally for the generation growing up with such systems but will require some work for us older folks. Instead of “dissing” a system when it makes a “silly” mistake (which is really a fairly lame way of asserting our human superiority), we should learn to simply ignore those.
I personally do not see an ultimate barrier to deep understanding becoming part of what systems can do, but I believe it will still take a long time. Until then we can asymptotically approach what humans can do, but probably with significantly higher outcome variance.