Tech Tuesday: Data Structures (Lists, Trees, Graphs)
As previously promised, this will be the last Tech Tuesday post on Data Structures for a while and starting next week we will return to other aspects of programming. Much of what we have seen in the previous data structure posts are structures that are built into the respective programming languages, such as structs in C, associative arrays in PHP, JSON in Javascript, objects in PHP and multi-dimensional arrays in C. Today we will take a look at rolling our own data structures that let us set up such things as linked lists, binary trees, graphs and more.
We will stick with our example of points. Let’s imagine we want a structure that lets us put together a list of points that together make up a shape. For each point we want to be able to access the next point in the shape. Here is a bit of Python code that gives us a data structure to accomplish that
class Point:
def __init__(self, x, y):
self.x = x
self.y = y
self.next = None
def append(self, point):
self.next = point
And here is how we could use this kind of point in our subsequent code
a = Point(2, 3) b = Point(5, 7) a.append(b) print a.next.x c = Point(10, 15) b.append(c) print a.next.next.x
You can just fire up Python from your terminal on a Mac or see this little bit of code running here.
Why and how does this data structure work? Each point is an object that as one element has a field refers to the next point in the shape. We make the connection to the next point by calling the append method on an existing point with the next point as the argument. The critical piece is the assignment “self.next = point” inside the append method. That does not create a copy of the point but instead makes self.next refer to the same object as point does. You can verify that by trying out the following (assumes you have executed the previous code):
b.x = 6 print b.x print a.next.x
You will see that changing the x coordinate of point b also changes it for the next point referred to by point a because this is one and the same object.
As it turns out Python has a fair bit of support for so-called collections, which are container data structures that we can use instead of having to roll our own. So instead of the above code, we could have used a deque to hold our points. Built-in collections are likely to be faster than anything we come up with but there is not always a built-in collection that meets our needs. The method of storing references to objects shown above could for instance be used to build a tree of points or an arbitrarily interconnected graph of points.
In programming then a good bit of time should be invested in figuring out the right data structure to use. That could be a built-in one or one that we create ourselves to fit our specific problem. But data structures alone don’t make a good program. We also need ways to insert data, retrieve it, manipulate it, etc. For instance if we have a list of points, how do we figure out whether they form a closed shape? Coming up with a series of steps to solve such a problem is called and algorithm and that is what the upcoming posts will be about.
Posted: 16th October 2012 – Comments
Tags:
tech tuesday data structures lists trees graphs
Tech Tuesday: Data Structures (Arrays)
If you have been a faithful reader of Tech Tuesday, you may be asking yourself just how many more posts on Data Structures could possibly be coming up as this is now the sixth such post (including the introduction). The answer is that after today there will be at least one more before moving on to a different subject for now. But that’s not for a lack of material — there is always more to say about data structures. And if nothing else one important takeaway from these posts should be that understanding the keywords of a programming language is only a tiny part of what it takes to be effective at programming.
The posts so far have dealt with data structures that let us combine different attributes into a single thing (structs, maps, JSON, objects). But what if we have many of the same thing? Keeping with our canonical example the point. What data structure can we use to keep track of many points? In the examples so far we used two separate variables to hold our point A and point B. What if we had a 1,000 points that together make up an image of size 20 x 50? Using 1,000 different variables would be incredibly cumbersome.
Instead, most programming languages (but not all) have built in support for arrays. We already encountered a version of the array in the post on maps, which are also known as associative arrays. But what we are interested in now are arrays that use integers as indices and can be multidimensional as in the following example, which makes a 20 x 50 image of random colors with each color represented by a value between 0 and 255:
int points[20][50]; /* this is our two dimensional array */
int x, y;
for (x = 0; x < 19; x++) {
for (y = 0; y < 50; y++) {
points[x][y] = rand() % 256; /* modulo to get values 0 to 255 */
}
}
You can see this code running here. The notation point[x][y] lets us access the array element at the position x and y. As a longstanding programming convention, the indices always start at 0. Which means that for an image of size 20 x 50, we go from 0 to 19 for x and from 0 to 49 for y as you see in the code. This convention has been the source of many a programming bug. If we want the array element that represents the point at x = 2 and y = 3, we need to access point[1][2] instead of point[2][3]. This is an instance of what is known as the off-by-one bug.
Where then does this convention come from? In some languages, such as the C code above, arrays with integer indices have super efficient access. In those languages you declare the size of the array upfront along with the size of each element as we did with the line “int points[20][50];”. A 20 by 50 array of bytes (instead of integers in the example) would take up only 1,000 bytes, or less than 1 kilobyte which is technically 1,024 or 2 ^ 10 bytes. Using a byte we could represent up to 256 different colors in our point example.
Now assume the whole array starts at a say memory location 5000 (usually memory locations would be denoted in hexadecimal but I will use decimal to make this example easier to follow). The program can then reserve memory locations 5000 to 5999 for the array which will consist of 50 rows (y) of 20 columns (x) each. So point[1,2] can then be found at memory location 5041. In general point[x,y] can be found at memory location 5000 + y * 20 + x. You can now see immediately why it makes sense for the array indices to be “zero based” (meaning starting at 0). point[0,0] is located at the beginning of the reserved memory region as 5000 + 0 * 20 + 0 = 5000 and point[19,49] is located at 5000 + 49 * 20 + 19 = 5999 which is the end of the array.
That approach to laying out arrays in memory does, however, not work for dynamic languages. If you don’t know in advance how large an array is going to be and if you don’t even know the size of each element (as that might change from one element to the next), you can not have a simple formula for locating an element in memory. Instead, you need to have some intermediate lookup mechanism. Some table that contains the known indices and points to memory locations. That approach also handles another issue: so-called sparse arrays. Imagine that instead of the image where every point has a color, we are dealing with a huge matrix of one million by one million elements but of which only about one hundred have values and all the others are 0. Reserving vast amounts of memory would be terribly inefficient.
Next Tuesday, in what will be the last post on Data Structures (at least for now), we will take a look at another set of data structures that let us keep track of many things in the form of lists and other ways of linking elements to each other.
Posted: 9th October 2012 – Comments
Tags:
tech tuesday programming data structures arrays
Tech Tuesday: Data Structures (Objects)
This is the fifth post on Data Structures as part of my long Tech Tuesday series on programming. And today we finally make it to a structure known as an object. There is a whole literature on so-called object oriented programming, which sometimes seems like a modern idea because of its prominence in the 1990s but really goes much further back than that with programming languages such as Smalltalk in the 1970s and basic object concepts go back even earlier than that. While another approach known as functional programming has been all the rage recently, yesterday brought a big announcement from Microsoft about a language called TypeScript that brings strongly typed objects to the Javascript environment.
So what is that mythical object? At its most basic level it is simply an extension of the kind of data structures that we have been looking at. An object contains data attributes but also bundles in programming code that can operate on those attributes. We have been using a point as our example (which funnily enough is also one of the examples used on the TypeScript homepage). We have seen a variety of different ways so far of keeping the x and y coordinates of a point together, including structs, associative arrays and JSON. Now here is an example of a point as an object in PHP which you can see running here:
class Point {
var $x, $y;
function __construct($x, $y) {
$this->x = $x;
$this->y = $y;
}
function distance($p) {
return sqrt(($this->x - $p->x) * ($this->x - $p->x) + ($this->y - $p->y) * ($this->y - $p->y));
}
}
$a = new Point(3, 2);
$b = new Point(5, 7);
print $a->distance($b);
We start by declaring a class called Point. What’s a class you ask? It is the data type for our points. So far this is similar to the structs where used typedef in C to give ourselves a data type. Every point object that we later create in the code belongs to the class Point (which is another way of saying that the objects have the data type Point). Unlike the struct though, the class also contains some code. In the example above that consists of the oddly named function __construct and a function for calculating the distance from this point to another point.
What is the advantage of putting the distance code into the Point class as opposed to having a separate free floating distance function as we had for the structs? Imagine a future self (or some other engineer) making a change to how points are represented. Now you need to find all the functions that refer to the internal structure of points and adapt them accordingly. By putting these functions into the class, we know exactly where to find the code. That’s the purely pragmatic aspect.
But there is a deeper philosophical view that’s really important. By putting so-called methods into the class (that’s what we call the functions that are associated with an object data type) we are creating an abstraction. Both the internal representation of the object and the internal workings of the code are encapsulated inside the class and thus “abstracted away” from anybody simply using the object.
That is one of the fundamental ways to think about programming: the gradual building up of higher levels of abstraction. With every level of abstraction we gain more power. We could use the point objects to come up with line objects. Then line objects to create shape objects. And so on. The highest level abstraction is the program itself. That’s why people care so much about objects in programming. We will spend a lot more time on them in the future, including learning what the somewhat mysterious __construct function does, but before that we will continue with some more data structures.
Posted: 2nd October 2012 – Comments
Tags:
tech tuesday programming data structures objects
Tech Tuesday: Data Structures (JSON)
In today’s Tech Tuesday we are continuing our romp through Data Structures as part of the ongoing series on programming by looking at JSON. Now technically JSON, which stands for JavaScript Object Notation, is not a data structure but a textual notation (as its name implies). But JSON has become ubiquitous and provides a natural segue to objects (next Tuesday) that I figured it makes sense to discuss it here.
Sticking with the example of points, the JSON for a point might look like the following
{
"x" : 2,
"y" : 3,
"color" : {
"red" : 120,
"green" : 20,
"blue" : 200,
}
}
In this example instead of a single string for a color (“blue”) as before, I am showing how JSON lets us describe levels of nesting. The red, green and blue are sub-attributes of the color attribute.
How would we actually use this JSON? In modern implementations of JavaScript, we can make a direct assignment of JSON to a variable as follows:
var point = { <JSON from above> };
print(point.x);
print(point.color.green);
You can see this code running here. The variable point now holds the attributes that we described in easily human readable form in JSON. We access these attributes in the code using the dot notation shown above.
This simple example raises a bunch of questions. First, how does JavaScript implement this? JavaScript parses the text. We will learn what parsing is in more detail in the future but for now suffice it to say that it essentially means turning the text into a data structure. Which data structure? Well depending on how you look at it either as an associative array or an object. Huh? Yes, under the hood it is an associative array for sure and JavaScript even supports notation that makes this clear: instead of point.x we can write point[“x”]. But most JavaScript code that you will see uses the point.x notation instead which represent viewing point as an object. What makes an object special? That will be the subject of next Tuesday’s post.
Before that though a few more words about JSON. Because it is a textual notation, it is ideally suited for passing structured data between different programs independent of programming language. That is why many modern APIs (Application Programming Interface) consume and return JSON. Most languages require an explicit invocation of a parser to turn the JSON into a native data structure for that language which could be an associative array or something else (e.g., an object — in most languages there is a stronger distinction between the two than in JavaScript).
Posted: 25th September 2012 – Comments
Tags:
tech tuesday programming data structures JSON
Tech Tuesday: Data Structures (Maps and Hashes and Dictionaries, oh my!)
Last Tech Tuesday we started digging into Data Structures by examining records and structs. We saw that these allow us to group attributes belonging to something like a person or a product or a geometric point but generally don’t keep the metadata around when the program is executing. Metadata means data about data and here would be the names of the attributes (and potentially their respective data types). All we have access to are the attribute values.
There are other data structures that provide us more information at runtime. There is a group of these which we will look at today that goes by different names depending on the programming language. You may encounter them as a map, a dictionary, a hash, an associative array, and probably a few more. Let’s start with a simple piece of PHP code that illustrates the idea by reprising our example from last time:
$a = array("x" => 3, "y" => 2);
print("Point a has coordinates x = " . $a["x"] . " and y = " . $a["y"]);
$b = $a;
$b["y"] = 7;
print("Point b has coordinates x = " . $b["x"] . " and y = " . $b["y"]);
We can see an immediate big differences here from last week’s C code. We are not actually declaring a data type at all. We are simply assigning something directly to variable a. That something is an associative array which is created by a call to the function array, which takes as its inputs so called key-value pairs. The keys here are “x” and “y” and the values are 3 and 2 respectively. We can access the value stored for a given key by using the notation $variable[“keygoeshere”].
Now how would we go about defining a distance function? Here is the PHP code
function distance ($a, $b) {
return sqrt(($a["x"]-$b["x"])*($a["x"]-$b["x"]) + ($a["y"]-$b["y"])*($a["y"]-$b["y"]));
}
which you can see in action in this simple example. Since there is no point data type here the two arguments to the function are simply variable names. That means that static inspection of the program code alone wouldn’t be able to catch a bug such as making the following call distance(“I am not a point”, “neither am I”) which of course would not work.
So far it sounds like we have only lost things (data type checking of the code) and have wound up with slightly uglier notation. Why would we ever want to do this? Here is the first reason. Let’s say we later decide that we want to keep track of the color of points, we can simply write the following:
$a["color"] = "blue";
Our point a now has a new attribute called “color” with the value “blue.” This extra attribute only exists for point a and point b does not (yet) have it. We were able to do this without changing any of the existing code and everything will execute just fine. In the C example we would have had to change our definition of the struct and every new point would make room for a color attribute whether or not we use it.
A second advantage of this type of data structure is that we can examine which attributes are present at runtime. For instance, if our code dealt with both 2D and 3D points, where 3D points have a “z” attribute, we can determine if a particular point has the “z” attribute as follows
if (array_key_exists("z", $a)) print ("Point is 3D");
Of course we might have by accident stored a string in the z attribute and the code above wouldn’t detect that but we could add a check to make sure it contains say an integer. One place we might want to use the ability to examine points is in our distance calculation. If we have two 3D points we would use a different formula than for two 2D points.
Associative arrays, maps, dictionaries, whatever they may be called in a particular language have many uses other than keeping attributes for things together which we will maybe get around to examining in the future. Next Tuesday we will look at a particular way of expressing associative arrays called JSON which is short for JavaScript Object Notation. As the name suggests this originated with JavaScript but is now used much more widely.
Posted: 18th September 2012 – Comments
Tags:
tech tuesday data structures
Tech Tuesday: Data Structures (Overview)
Having taken some time off, I am excited to start Tech Tuesdays back up. We will continue looking at programming and the next topic are more complex data types which are also referred to as data structures. If you have been following the programming series, you may recall that we already had three separate posts dealing with data types. These introduced the concept of data types and discussed why they matter and how they are implemented. But those posts dealt with simple types of data such as a single number or character. How do we go about representing something more complicated like a chess board or a person inside of a computer program?
Let’s take the example of a person. We might want to keep track of the person’s name, age, and gender. Now with what we already know, we could write code as follows:
var personName = "Albert Wenger"; var personAge = 45; var personGender = "male";
We simply create three separate variables, each using a basic data type, to keep track of the three pieces of data. But this seems unsatisfying as we don’t have a way of referring to the person “as a whole” — so we can’t write something like
winner = fight(cook, thief);
where “fight” is a function that takes as its arguments two persons and returns a person as its result.
What we need instead is a way of bringing the individual attributes together into a single unit that we can then use throughout our program. Different programming languages offer different ways of accomplishing this goal. These go by names such as structs, records and objects. We will have one or more posts on that in what will be a mini-series on data structure.
There is another expressive problem though that we need a way of addressing. Let’s say we want to keep track of a group of persons, such as a class of students. If there are 25 students in the class then it would be very cumbersome to have to say something like
var student1 = ...; var student2 = ...;var student3 = ...; ...
So even once we can refer to a person as whole, we still don’t have a way to refer to a collection of persons. The same problem exists even for simple data types — right now we don’t have a way of referring to a collection of numbers. Again programming languages offer data structures to help with this. These go by names such as arrays, lists, and sets and will be the subject of another group of posts in this data structure mini series.
We could start by looking either at the first problem — grouping multiple attributes into a single whole — or the second problem — collections of multiple things. I haven’t decided yet which makes more sense, so if anyone reading this has a suggestion as to where to start I welcome it.
Posted: 4th September 2012 – Comments
Tags:
tech tuesday programming data structures



