next up previous
Next: About this document ...

CSCI-1200 Computer Science II
Summer, 2002
Worksheet 23

Sorting

One of the classic activities that programmers must solve is sorting, i.e., reordering elements of a collection in either ascending or descending order according to one of the fields of the elements, the key. A detailed study of sorting is important for a number of reasons. A great deal of computer time in industrial and commercial settings is spent doing sorting, and so it is important to do it efficiently. The sorting paradigm is also interesting for studying algorithms. There are more than 20 different sorting algorithms, and they often differ dramatically on their performance depending on the nature of the data.

A distinction is often made between internal sorting, in which all of the data are stored in the program, and external sorting, in which much of the data are in files. Most sorting algorithms deal with internal sorts.

The sorting algorithm which is probably the easiest to understand is insertion sort. To sort a list of $n$ elements, iterate through a loop $n-1$ times. On each iteration i, move element i to the left until it is in its correct place. Thus, after i iterations, all of the elements from 1 to i are correctly sorted relative to each other.

An insertion sort requires two nested loops. The outer loop increments the loop iterator i from 1 to size-1. On each pass through this loop the ith element will be put into its correct place in the list of all elements less than i. The element at position i is copied to a temporary variable temp. Then the inner loop iterator j is set to i-1. On each pass through the inner loop, j is decremented by 1, and the element at position j is moved down by 1. This process continues until either j is zero or the element at position j-1 is less than temp. At this time, the value of the element at position j is set to temp.

Here is the code for insertion sort.


template <class T>
void insertionsort(T data[], int size)
{
     T temp;
     int i,j;
     for (i=1;i<size;i++) {           // outer loop
          if (data[i] < data[i-1]) {
              temp = data[i];
              for (j=i-1;j>=0;j--) {  // inner loop
                  // C
                  data[j+1]=data[j];
                  if (j==0 || data[j-1] < temp)
                     break;
              }
              data[j]=temp;
          }
          // A
     }
}

If the initial array had these values:
34 31 12 23 47 13 28 10
and if we add a statement that printed out the array after each pass through the outer loop (at point A), the output would look like this:


31 34 12 23 47 13 28 10 
12 31 34 23 47 13 28 10 
12 23 31 34 47 13 28 10 
12 23 31 34 47 13 28 10 
12 13 23 31 34 47 28 10 
12 13 23 28 31 34 47 10 
10 12 13 23 28 31 34 47

Notice that after each iteration all of the elements less than i are sorted.

Exercise 1: To analyze the performance of insertion sort, we need to know how often the statements in the inner loop are executed. Take this code and add a main() which takes one argument, the number of elements to be sorted, n. Using the random number generator function random() in <stdlib.h>, generate n numbers, store them in an array, and sort them. Create a global integer variable count which will keep track of the number of times that the program passes through the inner loop by adding a statement to increment count at point C in the code.

Run the program with an n of 250, 500, 1,000, 2,000, 4,000, and 8,000, displaying the value of count each time. Figure out an algebraic expression which describes how the run time of insertion sort grows as n increases. What is the order (big O) of this expression?

The analysis: Assume random order. Here is our array:

$\overbrace{\overbrace{x_0, x_1, x_2, \ldots, x_i}^{n/2}, \ldots, x_{n-1}}^{n}$

For each i, how far back do we need to go to insert $x_i$? On the average, $\frac{i}{2}$. One comparison and one assignment are done for each. Since we go from $i=1$ to n, on average $i$ has the final value $\frac{n}{2}$. Thus, the formula for the run time of insertion sort is:

$\frac{1}{2} n \times \frac{1}{2} n = \frac{1}{4} n^2$

This is $O(n^2)$.

Selection Sort

The disadvantage of insertion sort is that it requires a lot of moving. This is OK if the size of the data elements is small, but very costly if they are large. Suppose that instead of integers, we were sorting student records, and each student record had several hundred bytes of data. Insertion sort would do a lot of time-consuming copying of this data. Another simple sorting algorithm, selection sort, gets around this by immediately moving the record into its correct spot.

As with insertion sort, we have two loops. The outer loop has a loop iterator i which runs from size-1 to zero. On each pass through the loop, the highest element among the remaining elements (those below i) is found, and is swapped with element i. Thus the inner loop runs from i down to zero and locates the highest element in that range.

Here is the code for selection sort. It uses a utility function swap which swaps the values of its two arguments.


template <class T>
void selectionsort(T data[], int size)
{
    void swap(T &, T &);
    int i,j, maxpos;
    T max;
    for (i=size-1;i>0;i--) {  // outer loop
        maxpos=i;
        max = data[i];
        for (j=i-1;j>=0;j--) {  // inner loop, look for largest remaining value
            // C
            if (data[j] > max) {
               max = data[j];
               maxpos = j;
            }
        }
        swap(data[i], data[maxpos]);
        // A
    }
}

If the initial data looked like this: 34 31 12 23 47 13 28 10
Here is what the list would look like after each pass through the outer loop (at point A):


34 31 12 23 10 13 28 47 
28 31 12 23 10 13 34 47 
28 13 12 23 10 31 34 47 
10 13 12 23 28 31 34 47 
10 13 12 23 28 31 34 47 
10 12 13 23 28 31 34 47 
10 12 13 23 28 31 34 47

Recall that i runs from size down to 1. Notice that each pass finds the highest element in the range i .. 0 and swaps this value with element i.

Exercise 2: To analyze the performance of selection sort, we need to know how often the statements in the inner loop are executed. As you did for Exercise 1, generate n random numbers and sort them. Add a global counter to keep track of the number of times the program passes through the inner loop. Add a statement at point C to increment the counter.

Run the program with an n of 100, 250, 500, 1,000, 2,000, 4,000, and 8,000, displaying the value of count each time. Figure out an algebraic expression which describes how the run time of selection sort grows as n increases. What is the order (big O) of this expression?

Analysis of Selection Sort: We know exactly how many passes it will take. We pass through the outer loop n times and through the inner loop i times, with i going from n to 1. So the number of passes through the inner loop is roughly $\frac{1}{2}n^2$. This is $O(n^2)$.

The analysis of these two sorts assumes that the data are random. Our experiment above used random numbers. Random numbers are of interest to academics, but in the real world, data to be sorted are seldom in truly random order. In fact, they are often nearly sorted.

Exercise 3: How would the performance of insertion sort and selection sort differ if the data to be sorted were: (a) already completely sorted   (b) sorted in descending order

Divide and Conquer Sorting

The divide and conquer strategy is to take a problem and split it into smaller but similar problems. With sorting, this typically means that we divide the problem of sorting into two smaller problems of sorting. To do this, we divide the array to be sorted into two smaller arrays, sort those two arrays, and then combine the two sorted arrays together to form one sorted array. We recursively divide the array until we get an array with one element, since an array with only one element is already sorted. An algorithmic description of this strategy is as follows:


void Sort(array, num_elements)
{
    if (num_elements is greater than 1) {
        Partition the array into array1 with num_array1 elements and
                                 array2 with num_array2 elements
        Sort(array1, num_array1);
        Sort(array2, num_array2);
        Combine(array1, num_array1, array2, num_array2);
    }
}

Quicksort

For many purposes, quicksort is the fastest sort, and as such it is widely used. In quicksort, all of the work is done in the partition phase of the algorithm, and no combine phase is needed. To partition the array, a pivot element is chosen from the array. Our hope is that about half of the elements in the array will be less than this pivot and that about half of the elements in the array will be greater than this pivot. We then partition the array so that elements less or equal to the pivot are in one subarray and elements greater than or equal to the pivot are in the other subarray. In other words, the array after partitioning will contain elements that are less than or equal to the pivot element followed by elements that are greater than or equal to the pivot element. Therefore, the problem has been divided into two smaller problems - sorting the array of elements less than or equal to the pivot element, and sorting the array of elements greater than or equal to the pivot element. Under most circumstances, you can just choose the first element in the array as the pivot, and this example does this.

Here is the code for the main function of quicksort. It uses a utility function partition to split the array into the two subarrays. The value returned is the location of the pivot element (which was initially the first element in the data set (data[0]). After returning from the partition call, all elements in the array with values less than the pivot are in locations less than pivotloc, all elements in the array with values greater than the pivot are in locations greater than pivotloc, and the pivot element is in pivotloc.

For example, if the initial array is
34 12 41 56 88 23 47 61 19 70
after returning from partition, pivotloc would have the value 3, and the data would look like this
23 19 12 34 88 56 47 61 41 70
The pivot value 34 (initially data[0]) is in location 3, and all elements in the subarray below this location are less than the pivot and all elements in the subarray above this location are greater than the pivot.

Then each of the two subarrays is sorted using quicksort.

template <class T>
void quicksort(T data[], int size)
{
    if (size > 1) {
        int pivotloc = partition(data, size);
        quicksort(data, pivotloc);
        quicksort(data + pivotloc+1, size - pivotloc-1);
    }
}

Most of the work done by this algorithm is done in partitioning the array. To partition, we need two int iterators, lo and hi. The pivot element is the first element of the array, so the lo is initialized to 1. hi is set to one less than the size of the array. We increment lo until we find an element which is greater than the pivot, then we decrement hi until we find an element which is less than the pivot, and we swap these two elements. This continues until lo and hi meet. These can either meet at the first element in the high subarray or the last element in the low subarray. The last step is to swap the pivot with the highest element in the low subarray. The function returns this location. Here is the code for partition, using a utility function swap, which simply swaps the values of its two arguments.

template <class T>
int partition(T data[], int size)
{
    T pivot = data[0];
    int lo = 1;
    int hi = size;
    while (lo < hi) {
        hi--;
        while (data[lo] < pivot && lo < hi) 
            lo++;
        while (data[hi] > pivot && lo < hi) 
            hi--;
        if (lo < hi)
            swap(data[lo], data[hi]);
    }
    if (lo > 0) {
        if (data[lo] > pivot) lo--; 
        swap(data[lo],data[0]);
    }
    return(lo);
}

Let's step through this code with the following array.

34 12 41 56 88 23 47 61 19 70

size is 10, and the pivot is 34. lo is initially set to 1 and hi is initially set to 10. We start incrementing lo until we find a value greater than the pivot, so we stop at 41 (lo is 2). We then decrement hi until we find a value less than the pivot, so we stop at 19. We swap these two. Now data looks like this

34 12 19 56 88 23 47 61 41 70

We increment lo until we find the next value greater than the pivot (56), then we decrement hi until we find a value less than the pivot (23) and we swap these. Now our array looks like this

34 12 19 23 88 56 47 61 41 70

We continue until hi and lo meet. The variables hi and lo can either meet at the first element greater than the pivot (in which case we decrement lo) or the last element less than the pivot. The final step is to swap the value at the first location (the pivot) with the highest value in the low subarray (the value indexed by lo). Our array now looks like this.

23 12 19 34 88 56 47 61 41 70

We return the location of the pivot (3).

Let's do a complete example. Here is the quicksort code modified to display the data before and after partition.

template <class T>
void quicksort(T data[], int size)
{
    int i;
    if (size > 1) {
        cout << "Start of sort, pivot is " << data[0] << " data ";
        for(i=0;i<size;i++)
             cout << data[i] << ' ';
        cout << endl;

        int pivotloc = partition(data, size);

        cout << "After partition pivotloc is " << pivotloc
             << " data ";
        for (i=0;i<size;i++) 
            cout << data[i] << ' ';
        cout << endl;
        quicksort(data, pivotloc);
        quicksort(data + pivotloc+1, size - pivotloc-1);
    }
}
Here is the output if the initial array looks like this:
34 41 12 56 88 23 47 61 19 70 33 10 98 29 

Start of sort, pivot is 34 data 34 41 12 56 88 23 47 61 19 70 33 10 98 29 
After partition pivotloc is 6 data 19 29 12 10 33 23 34 61 47 70 88 56 98 41 
Start of sort, pivot is 19 data 19 29 12 10 33 23 
After partition pivotloc is 2 data 12 10 19 29 33 23 
Start of sort, pivot is 12 data 12 10 
After partition pivotloc is 1 data 10 12 
Start of sort, pivot is 29 data 29 33 23 
After partition pivotloc is 1 data 23 29 33 
Start of sort, pivot is 61 data 61 47 70 88 56 98 41 
After partition pivotloc is 3 data 56 47 41 61 88 98 70 
Start of sort, pivot is 56 data 56 47 41 
After partition pivotloc is 2 data 41 47 56 
Start of sort, pivot is 41 data 41 47 
After partition pivotloc is 0 data 41 47 
Start of sort, pivot is 88 data 88 98 70 
After partition pivotloc is 1 data 70 88 98

Exercise 4: For an array of $n$ elements, what is the order (big O) of the partition function?

Notice how important it is to pick a good pivot, so that the partitions are of nearly equal size. There are other strategies for selecting a pivot. In the median-of-three strategy, three elements of the array are selected, and the median of those three values is used as the pivot. This is more likely to yield a good pivot than simply choosing the first element. If you do this, make sure that you swap the pivot with the first element because the code provided above assumes that the pivot is in the first slot.

Exercise 5: To analyze the performance of quicksort, we need to know how often we compare array elements to the pivot element in the partition function. As before, create a main() that generates n random numbers and sorts them. Use a global integer variable count to keep track of the number of times that the program compares array elements to the pivot element in the partition function by adding statements at points C1 and C2 inside it. For the pivot selection strategy, use the first element of the array to be partitioned as the pivot.

Run the program with an n of 100, 250, 500, 1,000, 2,000, 4,000, and 8,000, displaying the value of count each time. What is the order (big O) of the quicksort function?

The analysis: Assume random order. The partition function should give us partitions of roughly equal size. So, each call to quicksort results in two recursive calls to quicksort with arrays of roughly equal size. It will take $O(log~n)$ recursive calls to quicksort to get an array of size 1, since the array size is halved in each call. In addition, partitioning takes $O(n)$ time, so the quicksort algorithm is $O(n~log~n)$. This can be seen as follows, for an array of 16 elements:


To quicksort 16 elements, we must
    do 2 quicksorts of 8 elements.
    To do 2 quicksorts of 8 elements, we must
        do 4 quicksorts of 4 elements.
        To do 4 quicksorts of 4 elements, we must
            do 8 quicksorts of 2 elements.
            To do 8 quicksorts of 2 elements, we must
                do 16 quicksorts of 1 element.
                Sorting 1 element is trivial!

When we have two partitions of n/2 elements, it takes the same time to partition those two partitions as it would to partition n elements. Partitioning n elements is $O(n)$, and a logarithmic number of recursive calls to quicksort are needed until we have an array with one element, making quicksort $O(n~log~n)$.

Merge Sort

In quicksort, all the work was in the partition phase, and there was no need for the combine phase. With merge sort, the partition phase is simple and most of the work is done in the combine phase.

All the partition phase of merge sort does is split the data into two partitions of roughly equal size, with no other requirements on this partitioning. The partitioning continues until we have an array of size 1, which is a sorted array. The combine phase takes two sorted arrays and creates a new sorted array from them containing all of their elements. Here is the code for the main merge sort function - it splits the array in half, recursively merge sorts each half, and then merges the two merge-sorted halves into one sorted array.


template <class T>
void mergesort(T data[], int size)
{
    void merge(T [], int);
    // A1
    if (size > 1)
    {
        int size1 = size / 2;
        mergesort(data, size1);
        mergesort(data + size1, size - size1);
        merge(data, size);
    }
    // A2
}

The merge function takes the data array and the size of that array. We know that the first size / 2 elements are in one sorted array and that the remaining elements are in the other sorted array. We will need a temporary array to merge these two sorted arrays into, since we cannot overwrite the array that we were passed. Once the arrays have been merged into this temporary array, the temporary array must be copied to the data array. It is because of this copying that is required for arrays that merge sort is used more often with linked lists. (With linked lists, all that must be done in merge is to rearrange the next pointers so that the resulting list will be traversed in sorted order - no copying of data is necessary!)

When merging, we take the next element from each of the two sorted arrays, decide which element is less, copy it to the temporary array, and advance to the next element in that array. This continues until the last element from either array is copied to the temporary array. Once we have finished one of the arrays, we need to copy any elements remaining in the ``left" side of the array (in case the ``right" side became empty, causing the while loop to complete). If there are elements remaining in the ``right" side of the array (meaning that the ``left" side became empty, causing the while loop to complete), we need not copy these elements to the temporary array because they are already in the correct location. Finally, we copy the elements from the temporary array to the data array, and free the temporary memory that we allocated.


template <class T>
void merge(T data[], int size)
{
    T* tmp_data = new T[size];
    int combine_index = 0;
    int left = 0;
    int end_left = size / 2;
    int right = end_left;


    while ((left < end_left) && (right < size))
        if (data[left] <= data[right])
            tmp_data[combine_index++] = data[left++];
        else
            tmp_data[combine_index++] = data[right++];

    // If right is empty, copy remaining elements from left into tmp_data
    for (; left < end_left; left++, combine_index++)
            tmp_data[combine_index] = data[left];

    // If left was empty first, we do not need to copy from right because
    // elements are already in their proper positions in array

    // Copy everything back to original array
    for (int pos = 0; pos < combine_index; pos++)
        data[pos] = tmp_data[pos];

    delete [] tmp_data;
}

Exercise 6: For an array of n elements, what is the order (big O) of the merge function (assume that the arrays being merged have random elements)?

If the initial array had these values: 34 31 12 23 47 13 28 10
and if you added statements that printed out the array contents at points A1 and A2 inside the mergesort function, the output would look like this:


At top: 34 31 12 23 47 13 28 10
   At top: 34 31 12 23
      At top: 34 31
         At top: 34
         At bottom: 34
         At top: 31
         At bottom: 31
      At bottom: 31 34
      At top: 12 23
         At top: 12
         At bottom: 12
         At top: 23
         At bottom: 23
      At bottom: 12 23
   At bottom: 12 23 31 34
   At top: 47 13 28 10
      At top: 47 13
         At top: 47
         At bottom: 47
         At top: 13
         At bottom: 13
      At bottom: 13 47
      At top: 28 10
         At top: 28
         At bottom: 28
         At top: 10
         At bottom: 10
      At bottom: 10 28
   At bottom: 10 13 28 47
At bottom: 10 12 13 23 28 31 34 47

Exercise 7: To analyze the performance of merge sort, we need to know how often we copy elements to the temporary array. As before, create a main() which generates n random numbers in an array and sorts them. Create a global integer variable count which counts the number of times the program copies elements to the temporary array by adding statements to increment the count each time such a copy occurs.

Run the program with an n of 100, 250, 500, 1,000, 2,000, 4,000, and 8,000, displaying the value of count each time. What is the order (big O) of the mergesort function?

The analysis: Assume random order. Each time mergesort is called, the array is split into two roughly equal parts. It will take $O(log~n)$ recursive calls to mergesort to get an array of size 1, since the array is roughly halved in each call. Merging two sorted arrays into one larger sorted array takes $O(n)$ time, where n is the size of the resulting merged array. Therefore, the merge sort algorithm is $O(n~log~n)$. This can be seen as follows, for an array of 16 elements:


To merge sort 16 elements, we must
    do 2 merge sorts of 8 elements.
    To do 2 merge sorts of 8 elements, we must
        do 4 merge sorts of 4 elements.
        To do 4 merge sorts of 4 elements, we must
            do 8 merge sorts of 2 elements.
            To do 8 merge sorts of 2 elements, we must
                do 16 merge sorts of 1 element.
                Sorting 1 element is trivial!

When we have two partitions of n/2 elements, it takes the same time to merge those two partitions as it would to merge four partitions of n/4 elements. Merging n elements is $O(n)$, and a logarithmic number of recursive calls to mergesort are needed until we have an array with one element, making merge sort $O(n~log~n)$.

A stable sort is defined as a sorting algorithm in which elements that compare as being equal appear in the same relative order in the sorted array as they were in the unsorted array. In other words, if two elements are equal and one is in the unsorted array at index 3 and the other is in the unsorted array at index 6, then the index of the first one will be less than the index of the second one in the sorted array. This may not be important for sorting integers, but may be important for sorting student records based on the last name of the student, for which there may be several students with the same last name. The merge sort code given in this worksheet does a stable sort.

Exercise 8: Our analyses of quicksort and mergesort assume that the data is randomly ordered. How would the performance of these sorts differ if the data were already sorted?

Alert: Many of the quizzes for this worksheet will be closed book, closed notes.




next up previous
Next: About this document ...
Paul Lalli 2002-05-17