next up previous
Next: About this document ...

CSCI.1200 - Computer Science II
Summer, 2002 Worksheet 14
Algorithms, Algorithm Analysis

Alert: Some of the quizzes associated with this worksheet may be closed notes

This worksheet will introduce the concept of an algorithm and will provide some simple ways to measure the performance of an algorithm.

Definition of an algorithm

An algorithm is a procedure or a set of steps for solving a problem. Just as we distinguish between an abstract data type like a queue or list and the implementation of an ADT, we can distinguish between an algorithm and its implementation in a particular computer language.

There are often several different algorithms to solve the same problem. An important feature of the discipline of computer science is developing new algorithms to solve problems, and analyzing them to determine their performance. Performance usually means running time, i.e. how long does the program have to run to solve a problem of a particular size. (In some instances, it can also refer to other factors such as how much memory is required to solve a problem of a particular size.) Algorithms are closely linked to data structures, the way that data are stored. If data are stored in a particular way in memory, it is often easier to write an efficient (i.e. fast) algorithm.

Consider a common problem: We have a set of elements, and we want to find a particular element in this set. Suppose the elements are students.

Initially let's store them in an array in random order. The files student.h, setofstudents.h and main.cpp in the directory /dept/cs/cs2/wksht14/ define a simple class student, and a class SetOfStudents which stores some students in an array of students called data. The number of students currently in the set is n. Each student has a unique student number (a short int), and this is what we use to find a particular student. Such a unique identifier is called a key.

Here is the find() member function in the class SetOfStudents. It takes one argument, the idnumber of the student to search for. The algorithm is linear search; to find out if that student is in the set, it starts at the beginning of the array and moves along the array. It continues searching until it either finds the target or it reaches the end of the array.

  
student SetOfStudents::find(short targetid) {  
        int i=0;
        while (i < n) {
             if (targetid == data[i].getnum())
                 return data[i];
             else i++;
        }
        // if we get here, the student was not in the database
        // so we return an empty student.
        student nullstudent("",0);
        return nullstudent;
}
Exercise 0: If there are n elements in the set, what is average number of comparisons that the find algorithm must make to locate a element, assuming that the element that it is looking for is in the set.

Answer $n/2$

For small sets, this is not a problem, but if there are a large number of elements in the set, searching could be a time consuming process.

In situations where we want to do a large number of searches of a large set, a better method is to use an algorithm called binary search. Binary search only works in cases where the elements are sorted by the key. If the elements are sorted, you can go to the middle element in the set and compare its value to the target. If the middle element has a value greater than the target, you know that the element that you are searching for is below the middle. Thus, you have eliminated half of the search space. For example, if there were 20,000 elements in the set, and these are sorted by student number, you first compare the target to the element at cell 10,000. If this element has a student number greater than that of the target, you know that the target is in the range 0 ... 9999. You then compare the target to the element at cell 5,000. If this is less than the target, you know that the target is in the range 5001 ... 9999, so you have eliminated 5000 more elements. Each comparison reduces the size of the search space by one half.

Algorithms are generally described in structured English or pseudocode rather than in a specific programming language. This helps to distinguish an algorithm from its implementation. Here is the algorithm to search for TARGET in an array, and return either the subscript to it if it is in the array, or FALSE if it is not in the array

1: Set LO to the first subscript in the set
2: Set HI to the subscript of the highest element in the set
3: Set MID to (HI - LO)/2 + LO
This is the mid point of the range between LO and HI
4: If the element at MID is the TARGET, return MID
we have found what we are looking for
5: If the element at MID is greater than the TARGET
SET HI to MID - 1
else
Set LO to MID + 1
6: If LO is greater than HI return FALSE
an element with key TARGET is not in the set
7: GO TO 3

Here is an example. Suppose we are looking for the student with number 38 and there are 16 students in the set, sorted in ascending order by number. If the set looks like this:

10 11 18 20 22 29 30 31 34 38 40 44 53 58 61 70

LO is initialized to 0 and HI is initialized to 15, MID is initially 8 Since the element in cell 8 (31) is less than TARGET (38), LO is set to 9. MID is recalculated to be 12. Since the element at cell 12 (53) is greater than TARGET, HI is set to 11 and MID is recalculated to be 10. This is the target, so we have found what we are looking for with only three comparisons.

Exercise 1: Copy the files student.h, SetofStudents.h, and main.cpp from
/dept/cs/cs2/wksht14/ to your local directory, and modify the find function in the class SetOfStudents so that it performs binary search. Note that you will have to modify the data so that the students are inserted into the set in ascending order by idnumber.

Exercise 2:
a. Since binary search reduces the size of the search space by half each time that a comparison is made, what is the maximum number of comparisons that will need to be made if the size of the set is:

16
32
100
500
1,000
10,000

b. Write an algebraic (not a computer) function which computes the maximum number of comparisons required for binary search as a function of n, the number of elements in the set. The solution is at the end of this worksheet.

Now we have two different algorithms for our find function, linear search and binary search. The run time of linear search is some constant $C$ times $n/2$ (if the element is always in the set), while the run time of binary search is $C \times \lceil log_2(n)\rceil$ 1

It is probably the case that the constant $C$ for binary search is quite a bit larger than is the constant $C$ for linear search. This means that for small sets, linear search might actually be faster than binary search, but for large sets, binary search is much faster. Even if the constant for binary search is ten times as large as it is for binary search, the run time to search a set of 10,000 elements is 50 times faster for binary search than for linear search.

If the search key is unique (i.e. no two elements will have the same key), and the range of keys is known and is not too large, there is an even better way to perform searches. Create an array of the size of the total number of values for the search key, and the insert function will simply copy the record into the cell corresponding to that key. The student with the student number 1234 would be copied to cell 1234 in the array.

Exercise 3: Rewrite the code for the class SetOfStudents to incorporate this change.

Exercise 3a: Write an algebraic function (not a C++ function) of search time as a function of the size of the set for this implementation. The solution is at the end of this worksheet.

Let's review what we have done. We have implemented three different algorithms for find, looking for an element in a set. We have done some simple analysis of the run time of these algorithms.

Exercise 4: We have seen three different methods of performing the find algorithm. Describe them, show an algebraic function describing the running time of each as a function of the size of the set, and list any constraints or disadvantages.

Algorithm Efficiency

It is possible to write an algebraic function which describes the run time of a program (or a program component) as a function of the size of the input.

Exercise 5: For each of the following functions, write a precise algebraic function to compute how often the function dostuff would be called as a function of n.


void fctnA(int n)
{
    int i,j;
    for (i = 0; i < n; i++)
        for (j=0;j<n;j++)
            dostuff();
}

void fctnB(int n)
{
    int i;
    for (i=n;i<n+10;i++)
        dostuff();
}

void fctnC(int n)
{
    int i;
    for (i=0;i<n; i += 3)
        dostuff();
}

void fctnD(int n)
{
    int i,j;
    for (i=0;i<n;i++)
        for (j=i;j<n;j++)
            dostuff();
}

void fctnE(int n)
{
    int i;
    for (i=1;i<n; i *= 2)
        dostuff();
}

void fctnF(int n)
{
    int i,j;
    for (i=0;i<n;i++)
        for (j=1;j<n; j *=2)
            dostuff();
}

The solutions are at the end of this worksheet.

If the function dostuff runs in constant time C, then the answers to the above exercises describe the running time of the function fctn as a function of n.

The Order of Run Time Functions and Big O Notation

Computer scientists seldom concern themselves with the exact expression to determine the run time of an algorithm. They usually specify only the order of the expression.

If $f(n)$ and $g(n)$ are functions defined for positive integers, then to write
$f(n)$ is $O(g(n))$
(referred to as order $g(n)$ or big Oh of $g(n)$ means that there exists a constant c such that
$\vert f(n)\vert \leq c\vert g(n)\vert $
for all sufficiently large positive integers n.

Here is a contrived function where the running time is $n^2+5n+4$

void fctn(n)
{
    for (int i=0;i<n;i++)
        for (int j=0;j<n;j++)
            dostuff();
    for (i=0;i<n;i++)
        for (j=0;j<5;j++)
            dostuff();
    for (i=0;i<4;i++)
            dostuff();
}

Thus, if $f(n)$ is $n^2+5n+4$, the value of this function is less than $2n^2$ for all n greater than 5, so $c$ is 2, and $g(n)$ is $n^2$. In other words, this algorithm is order $n^2$ or Big Oh of $n^2$.

The order of any polynomial is the largest term with a non-zero coefficient. The expression $3n^5 - 5n^3 + 17n^2 + 12n$ is order $n^5$.

Most of the algorithms that we discuss in this course will be of one of the following five orders.

There are several other orders that deserve mention, although we will probably not discuss any such algorithms in this course.

Why is algorithm performance so important? After all, computer hardware is getting significantly faster every year, so it might seem that any kind of problem we want to solve will eventually be solvable quickly. This is an illusion, however, for two reasons. First, as faster computers become available, we usually ``raise the ante'' by applying them to larger and larger problems. We never have ``enough'' computational power. Second, for many kinds of problems, making a good algorithm choice is far more important than the speed of the hardware--when applied to large problems (one million data points, for example), a good algorithm running on a desktop PC can beat a poor one running on the fastest supercomputer.

For example, suppose one student implements linear search and the run time of the function is $10 \times (n/2)$, and another student implements binary search, where the run time is $100 \times log(n)$ Here is the run time for input of various sizes.

  linear binary
input search search
size (n) time time
10 50 400
50 250 600
100 500 700
500 2,500 900
1,000 5,000 1,000
5,000 25,000 1,300
10,000 50,000 1,400
50,000 250,000 1,600

For small inputs, linear search is often faster, but for very large inputs, it is obvious that binary search is very much faster.

Timing Runs

As you advance in computer science you often want to be able to time a number of runs of a program to see the run time with different sized inputs. There is function int clock() which returns an integer value representing time in thousandths of a second. Call this function at the start of the event that you want to time, storing the return value, execute the event to be timed, and then call clock() again. To get the time in seconds, divide the difference between the endtime and the starttime by CLOCKS_PER_SEC. This is a constant defined in the header file time.h to be 1000. Here is the code to do this.

// timetest.cpp
// a program to test how to time a function
#include <iostream.h>
#include <time.h>
int main()
{
    int starttime, endtime;
    starttime = clock();  //should be zero but don't assume it
        // event to be timed here
    endtime = clock();
    double elapsed = (double)(endtime - starttime) / CLOCKS_PER_SEC;
    cout << "elapsed time is " << elapsed << " seconds" << endl;
    return 0;
}

Exercise 6 Time how long it takes to go through an empty loop 10,000,000 times, 20,000,000 times and 30,000,000 times. Do the numbers look reasonable?

Solution to Exercise 2

16      4  
32      5
100     7
500     9
1,000  10
10,000 13
This function is $\lceil log_2(n) \rceil$. The symbols $\lceil$ and $\rceil$ refer to the ceiling function, the smallest integer greater than or equal to the enclosed real expression.

Solution to Exercise 3a

This algorithm is constant time. The time to find a student is independent of the number of students in the set.

Solution to Exercise 5

fctnA $n^2$ fctnB $10$ fctnC $\lceil n/3 \rceil$
fctnD $(n(n+1))/2$ fctnE $\lceil log_2(n) \rceil$ fctnF $n \lceil log_2(n) \rceil$



next up previous
Next: About this document ...
Paul Lalli 2002-05-17