Tech Tuesday: Data Structures (Arrays)

If you have been a faithful reader of Tech Tuesday, you may be asking yourself just how many more posts on Data Structures could possibly be coming up as this is now the sixth such post (including the introduction).  The answer is that after today there will be at least one more before moving on to a different subject for now.  But that’s not for a lack of material — there is always more to say about data structures.  And if nothing else one important takeaway from these posts should be that understanding the keywords of a programming language is only a tiny part of what it takes to be effective at programming.

The posts so far have dealt with data structures that let us combine different attributes into a single thing (structs, maps, JSON, objects).  But what if we have many of the same thing?  Keeping with our canonical example the point.  What data structure can we use to keep track of many points?  In the examples so far we used two separate variables to hold our point A and point B.  What if we had a 1,000 points that together make up an image of size 20 x 50?  Using 1,000 different variables would be incredibly cumbersome.

Instead, most programming languages (but not all) have built in support for arrays.  We already encountered a version of the array in the post on maps, which are also known as associative arrays.  But what we are interested in now are arrays that use integers as indices and can be multidimensional as in the following example, which makes a 20 x 50 image of random colors with each color represented by a value between 0 and 255:

int points[20][50]; /* this is our two dimensional array */
int x, y;

for (x = 0; x < 19; x++) {
    for (y = 0; y < 50; y++) {
        points[x][y] = rand() % 256; /* modulo to get values 0 to 255 */
    }
}

You can see this code running here. The notation point[x][y] lets us access the array element at the position x and y.  As a longstanding programming convention, the indices always start at 0.  Which means that for an image of size 20 x 50, we go from 0 to 19 for x and from 0 to 49 for y as you see in the code.  This convention has been the source of many a programming bug.  If we want the array element that represents the point at x = 2 and y = 3, we need to access point[1][2] instead of point[2][3].  This is an instance of what is known as the off-by-one bug.

Where then does this convention come from?  In some languages, such as the C code above, arrays with integer indices have super efficient access.  In those languages you declare the size of the array upfront along with the size of each element as we did with the line “int points[20][50];”.  A 20 by 50 array of bytes (instead of integers in the example) would take up only 1,000 bytes, or less than 1 kilobyte which is technically 1,024 or 2 ^ 10 bytes.  Using a byte we could represent up to 256 different colors in our point example.

Now assume the whole array starts at a say memory location 5000 (usually memory locations would be denoted in hexadecimal but I will use decimal to make this example easier to follow).  The program can then reserve memory locations 5000 to 5999 for the array which will consist of 50 rows (y) of 20 columns (x) each.  So point[1,2] can then be found at memory location 5041.  In general point[x,y] can be found at memory location 5000 +  y * 20 + x.  You can now see immediately why it makes sense for the array indices to be “zero based” (meaning starting at 0).   point[0,0] is located at the beginning of the reserved memory region as 5000 + 0 * 20 + 0 = 5000 and point[19,49] is located at 5000 + 49 * 20 + 19 = 5999 which is the end of the array.

That approach to laying out arrays in memory does, however, not work for dynamic languages.  If you don’t know in advance how large an array is going to be and if you don’t even know the size of each element (as that might change from one element to the next), you can not have a simple formula for locating an element in memory.  Instead, you need to have some intermediate lookup mechanism.  Some table that contains the known indices and points to memory locations.  That approach also handles another issue: so-called sparse arrays.  Imagine that instead of the image where every point has a color, we are dealing with a huge matrix of one million by one million elements but of which only about one hundred have values and all the others are 0.  Reserving vast amounts of memory would be terribly inefficient.

Next Tuesday, in what will be the last post on Data Structures (at least for now), we will take a look at another set of data structures that let us keep track of many things in the form of lists and other ways of linking elements to each other.

Posted: 9th October 2012Comments
Tags:  tech tuesday programming data structures arrays

Newer posts

Older posts

blog comments powered by Disqus
  1. continuations posted this

Newer posts

Older posts