Intro to Computer Science

Unit 4 - Sorting and Searching Algorithms

Table of Contents

  1. Intro to Algorithms; Selection Sort
  2. BubbleSort; Big-O notation
  3. Linear Search
  4. Binary Search
  5. Nested Loops
  6. Debugging
  7. Content Review

4.0. Introduction to Algorithms; Selection Sort

4.0.0. Overview

We've spent the first weeks of this class learning a little about Python's basic data types (int, float, str), Python's decision-making control structures (if, if-else, if-elif-else), Python's looping control structures (while, for), and the list data type.

Now let's learn how to use those tools to actually build some working algorithms.

4.1. Computational Thinking, Problem Solving, and Algorithms

Computational Thinking

Computational thinking is a problem-solving approach that is useful in learning how to program a computer, but also has applications in many other problem-solving contexts. Computational thinking includes:

We've been using the principles of computational thinking almost every day in this course as we learn how to think about problems in a way that will help us better understand how to write code that will solve the problem.

Definition: Algorithm

An algorithm is a step-by-step strategy for solving a problem, usually discovered or developed by breaking a problem down into individual components. By understanding the pattern of problem presented, we can develop an abstract strategy for solving that type of problem.

Think of an algorithm as a "recipe," with instructions on how to do something.

What algorithms have we developed in here? The Sieve of Eratosthenes is one example. Based on the idea that prime numbers are numbers that don't have any multiples in them, we created a list of numbers, and starting at number 2 (the lowest prime), we identified that as a prime, and took out all the other numbers on the list that were multiples of that. Then we moved up to number 3, identified that as a prime, and took out the multiples of that number as well, and so on. That strategy, that recipe, that algorithm is one way of creating a list of prime numbers.

An algorithm may be expressed as a series of instructions, as above. It might be a numbered list of items in a procedure. It might be encoded as a flowchart, or written as pseudocode. For an algorithm to be expressed as code, of course, it eventually has to get written in a form that the computer will understand.

Show/hide Sieve of Eratosthenes

#!/usr/bin/env python3 """sieve_of_eratosthenes.py Prime finder""" maxValue = int(input("Maximum prime you want to find? ")) possiblePrimes = [] # Create list of numbers from 2 to maximum value for i in range(2, maxValue + 1): possiblePrimes.append(i) # Now find primes while len(possiblePrimes) > 0: prime = possiblePrimes[0] # first number on list is a prime print("Here's a prime: ",prime) possiblePrimes.pop(0) # remove number at first position for aValue in possiblePrimes: # go through remainder if aValue % prime == 0: # of list and remove possiblePrimes.remove(aValue) # multiple of prime

4.1.1. Sorting and Searching

A classic way to learn about algorithms is to examine how one performs two very important computer operations: sorting, and searching.

We'll begin with the Selection Sort.

4.1.2. The Selection Sort algorithm

Selection Sort

The Selection Sort algorithm works as follows:

  1. Begin with a list of unsorted items.
  2. Beginning with the first item on the list, go through and find the location of the smallest value in the list.
  3. After finding the location of the smallest value, exchange the value at that smallest location with the value at the start of the list. (Smallest value on list is now first in the list.)
  4. Repeat, starting at 2nd position: go through list, find location of smallest item on list, and place it in 2nd position.
  5. Go through entire list in this fashion until done.

Let's write a Selection Sort algorithm in Python.

selection_sort.py

Write a program that:

  1. creates a list called num_list of 20 items, each one a random integer between 0 and 1000, inclusive.
  2. applies a Select Sort to this list. Display the list onscreen as your program works, and included a pause statement that requires you to hit the <Enter> key to advance the program each step.
  3. prints out the final sorted list.

4.1.3. Refactoring

Once you've written the basic program, let's refactor it.

Refactoring

"Refactoring" a program is the process of rewriting it so that it works differently, often more efficiently, then the original version of the program.

You've probably written this program with a single main() function running the sorting process. Let's refactor your Selection Sort.

In this case, let's just split the program up into three logical pieces. We'll have:

  1. a main() function that establishes the initial values that our program will need: how many values we're going to be sorting, and what the range of those values are, for example.
  2. a create_random_values() function that is responsible for creating the list of random values, and
  3. a sort() function that actually sorts the values in our list.

Refactoring selection_sort.py

Rewrite your selection_sort program so that the the process of creating the list of numbers and the process of sorting a list of numbers each occur in separate functions.

 

#!/usr/bin/env python3 """ selection_sort.py Demonstrates the use of a Selection Sort algorithm. The Selection Sort algorithm begins at the first position and moves through the entire list, looking for the smallest value. Once it finds the location of the smallest value, it exchanges the value there with the value at the first position. It then proceeds to the second position and moves through the entire list, repeating the process. By the time it arrives at the last position the list is sorted. @author Richard White @version 2015-06-08 """ import random def create_random_values(the_range, the_amount): """ This function takes a given range and produces a given amount of random values in a list. """ rand_values = [] for i in range(the_amount): rand_values.append( random.randrange(the_range) ) return rand_values def sort(randnums): for start in range(len(randnums)): minimum_index = start for i in range(start, len(randnums)): if randnums[i] < randnums[minimum_index]: minimum_index = i # swap randnums[start], randnums[minimum_index] = \ randnums[minimum_index], randnums[start] def main(): print("Creating random values...") randnums = create_random_values(1000, 20) print(randnums) print("Sorting random values...") sort(randnums) print("Done") if __name__ == "__main__": main()

4.1.4. Timing your program

Once the basic program is working, proceed through the following modifications of the program so that we can examine it in more detail.

Selection Sort continued

  1. In the selection_sort.py program, comment out your program's pause statement so that it will run without interference.
  2. Modify the program so that the list contains 1000 items rather than just 20.
  3. A major source of slowdown for the program is having to print out the list during each iteration. Comment out the print statement in the loop, and print only the final sorted list.
  4. You can determine more precisely how long it took the program to sort the items by including a time analysis like this.
import time def main(): # create random list here, then... pause = input("Hit <Enter> to begin sorting") # Establish initial time that sorting begins timeInitial = time.time() # # # Do the sorting here, then... # # # Get time that operation finished timeFinal = time.time() print("The Selection Sort of",len(numList),"items took") print(timeFinal - timeInitial,"seconds.")

4.2. Bubblesort, and Big-O Notation

4.2.0. Overview

The SelectionSort is one way of working through a list of items and putting them in order. Are there any other ways? Other strategies that we could use? Other algorithms that we could program?

And are some of the strategies faster than other, or more efficient? How would we know? You're running a Mac, I'm running a PC, that guy is running Windows 10, that girl is running Linux... We can't just time how fast the programs work, can we? That would measure the processing speed, and not the algorithm speed.

How do we determine which algorithm is better when we're all running it on different computers?

Let's find out!

4.2.1. The Bubble Sort algorithm

Bubblesort Algorithm

The Bubble Sort algorithm works as follows:

  1. Begin with a list of unsorted items.
  2. Start with the first item in the list, and compare it with the one right after it. If the second item is smaller than the first, swap them. (At this point, the higher of the two items has been "bubbled" up in the list.)
  3. Continue on to the second item, and compare it with the third, swapping them as required like before. Continue through the entire list in this fashion. By the time you've passed through the list this first time, the highest value will have bubbled up to the end of the list.
  4. Repeat this process again, passing through the list and swapping items as necessary.
  5. Continue passing through the list in this fashion until no more swaps are performed. At this time the list is sorted.

Let's write a BubbleSort algorithm in Python.

BubbleSort

Write a program that:

  1. creates a list called numList of 20 items, each one a random integer between 1 and 1000, inclusive.
  2. applies a Bubble Sort to this list. Display the list onscreen as your program works, and included a pause statement that requires you to hit the <Enter> key to advance the program each step.
  3. prints out the final sorted list.

Show/hide Bubblesort

#!/usr/bin/env python3 import random import os import time def main(): numOfNumbers = 10000 # fill up the list of numbers numList = [] for i in range(numOfNumbers): numList.append(random.randrange(1000000) + 1) print(numList) # sort the list for upperLimit in range(len(numList)-1, 0, -1): sorted = True for i in range(upperLimit): # pass through list to upperLimit # compare a number with the one right after it if numList[i] > numList[i+1]: numList[i],numList[i+1] = numList[i+1], numList[i] sorted = False if sorted: break # If I'm sorted, no need to continue looking through list # print the sorted list print(numList) if __name__ == "__main__": main()

Once the basic program is working, apply modifications to the program as you did above to determine how quickly your program can perform its sorting.

4.2.2. Big-O notation

Do you have any sense of which of these programs is better, and under what conditions? Which one is faster? Which one is more efficient?

One aspect of computer science is determining the relationship between the amount of data we feed an algorithm and how quickly that algorithm runs. Although computers are able to perform individual instructions very quickly, programs that require the execution of thousands, millions, and billions of instructions are obviously going to take longer, and execution time becomes a concern.

Which sorting algorithm above performed a sort of 20 numbers more quickly? Which algorithm performed a sort of 10,000 numbers more quickly?

Does the performance of these algorithms vary under certain conditions? What if the list to be sorted is already partly sorted? What if it's completely reversed? Does this affect the time it takes for the algorithm to work?

Big-O notation

To analyze the performance of a specific type of algorithm, computer scientists use "Big-O notation," which describes the number of instructions that must be performed as a function of the size of the data set.

Example: "Selection Sort is an O(n2) 'Oh-n-squared' algorithm." If the number of values to be sorted (n) triples, then the time to perform the sort will increase by a factor of n-squared, or nine.

There are three ways that we can determine the O-value of an algorithm.

  1. We can construct a mathematical proof of its operation parameters.
  2. We can do a quick-and-dirty analysis on a simplified version of the algorithm.
  3. We can conduct experiments that actually collect time measurements of the algorithm's operation.

Because we've already built a timer into our programs above, let's look at some data collected for a Selection Sort algorithm working to sort a random list of words.

All times were collected using a Selection Sort running in Python 2.7.2 on an Apple Macbook Pro with a 2.66GHz Intel Core i7 dual core processor.

list sizetime (s)
1000.00186
5000.020346
10000.069943
20000.26478
40000.577992
50001.63441
100006.392076
1500015.56388

And here's a graph of those list sizes and times:

Notice that the data points line up almost perfectly along a quadratic curve, clearly indicating that Selection Sort is an O(n2) algorithm.

Bubble Sort efficiency

Collect data on the Bubble Sort algorithm by running it with data sets of varying size. Then do an analysis of your data points.

Is Bubble Sort an O(n2) algorithm also? an O(n) algorithm? an O(n log n) algorithm?

4.3. Linear Search

4.3.0. Overview

A linear search is just what it sounds like: given a list of values (sorted or unsorted), go through the list one item at a time, starting at the beginning, and checking each value in turn until you either:

  1. Find the item on the list, in which case you indicate its location, or
  2. You go through the entire list and don't find the number, in which case you indicate that the search term has not been found on the list.

Why would we want to do this? It turns out that knowing the location of an item in a list is an important element in more complex algorithms and problems.

A linear search is not usually the most efficient means of finding an item in a list, but it's a good start for us at this point.

Let's see how to write one.

linear_search.py

Write a program that uses a function to create a list of random numbers. Include:

4.4. Binary Search

4.4.0. Overview

Sorting is one classic context for exploring different types of algorithms. You might be interested to know that Python actually provides a .sort() method that works on lists. It's built in to the language, and uses a Quicksort algorithm to sort any list you give it.

Pretty cool!

The next context for examining algorithms is the process of searching. Given a list of items, how do you go about finding an item on the list?

Imagine the various collections of things you have in your life: your music (if you have MP3s or CDs), your photos on your cellphone, your files on your computer, your notes for class... How do you look through those collections of items to find stuff? Do you keep those things ordered so that you can easily find one item in the collection, or do you have them piled up in various folders/piles?

Computers need to be able to find stuff. Let's see how we can search for things.

4.4.1. Searching

Okay, so we've got our lists of items sorted. How do you quickly go about finding something on the list? Just flipping through the list sequentially, starting with the first item and continuing until you get to the end doesn't seem very efficient, especially if you've got a very long list. This is a legitimate search strategy—it's called a Linear Search—but it's probably not the one we'd want to use here.

Binary Search algorithm

The Binary Search algorithm works as follows:

  1. Begin with a list of sorted items and a "search item" that you're looking for.
  2. Compare the search item with the item at the middle of the sorted list.
    • If the items match, you've found the location of that item!
    • If the search item is greater than the middle item of the list, the bottom half of the sorted list can be eliminated from consideration. The top half of the list will become our new search space.
    • If the search item is less than the middle of the list, the upper half can be eliminated, and the bottom half of the list will become our new search space.
  3. Continue searching through the smaller and smaller sections of lists until either the search item is found, or it is not found.

The Binary Search can be tricky to implement. Challenges include identifying the middle of a list of an even number of items, and keeping track of upper- and lower-bounds of the search space.

There are also more sophisticated concerns, like "what happens if you have a billion items to search through?" Large lists like this can tax an algorithm, and even the experts can get tripped up.

We'll keep things simple in here.

4.4.2 Implementing BinarySearch

Once you think you know how a binary search works, take a few moments to see if you can write and debug a Python version of it.

BinarySearch

Write a program that:

  1. begins with a sorted list l = [1, 2, 3, 4, 7, 9, 13, 14, 20]
  2. asks for a search item (a number between 1 and 20).
  3. performs a Binary Search for that item in your list, displaying the results as it goes.
  4. Print out the location of the search item, if it exists. Otherwise, print out a "search item not found" message.

Test your program with the following lists to see if it works. When testing your program, be sure to look for an item that is on the list, and also an item that isn't on the list.

Search patterns to try
List to searchItem to look forExpected result
[1, 2, 3, 4, 7, 9, 13, 14, 20]7Position 4
[1, 2, 3, 4, 7, 9, 13, 14, 20]1Position 0
[1, 2, 3, 4, 7, 9, 13, 14, 20]20Position 8
[1, 2, 3, 4, 7, 9, 13, 14, 20]-2Search item not found
[1, 2, 3, 4, 7, 9, 13, 14, 20]23Search item not found
[1, 2, 3, 4, 7, 9, 13, 14, 20]10Search item not found
[4, 7, 9, 13, 14, 20]9Position 2
[4, 7, 9, 13, 14, 20]13Position 3
[4, 7, 9, 13, 14, 20]20Position 5
[4, 7, 9, 13, 14, 20]2Search item not found
[4, 7, 9, 13, 14, 20]10Search item not found
[4, 7, 9, 13, 14, 20]22Search item not found

4.5. Nested Loops

4.5.0. Overview

Sorting and Searching algorithms really give you a chance to get some practice with one-dimensional lists. The next logical step is to look at a "list of lists"—a two-dimensional array.

4.5.1. Basic nested loops

You may recall that a basic nested loops consists of a loop inside a loop. One example would be using variables for ones and tens to count from 0 0 to 9 9:

for tens in range(10): for ones in range(10): print(tens, ones) print("Done!")

Another example is the Number Guessing Game. We had an inner loop that allowed the user to repeatedly guess a secret number, and an outer loop that asked them if they wanted to play again.

# Pseudocode user_wants_to_play = "y" while (user_wants_to_play == "y": # outer loop secret_num = random value guess_count = 1 # inner loop while (user_guess != secret_num and guess_count < 4) # get user_guess # check guess # give hint # game over user_wants_to_play = input("Wanna play again? ") print("Bye!")

4.5.2. Lists of lists

We've been able to keep track of a list of data, what you might call a row of data:

[ 0, 2, 4, 6, 8, 10 ]

A list can keep track of any kind of information: numbers, strings, ... and even lists. This "row" of data could hold three lists like this:

lst = [ [1,2,3,4], [5,6,7,8], [9,10,11,12] ] index = 0 1 2 lst[0] = [1,2,3,4] lst[1] = [5,6,7,8] lst[2] = [9,10,11,12]

Let's make sure we know how we can refer to these values. If lst[1] is [5,6,7,8], what is lst[1][2] ?

It's the item index 2, in [5,6,7,8], the number 7.

In fact, we can refer to any element in the list-of-lists by referring to the index of the main list, and then the index of the element in the "sub-list."

4.5.3. Re-thinking Lists of Lists

The list of lists we've been looking at is this one:

lst = [ [1,2,3,4], [5,6,7,8], [9,10,11,12] ]

This is fine, but you can also write the same list like this:

[ [1,2,3,4], [5,6,7,8], [9,10,11,12] ]

When we arrange the lists like this, it's easy to think of the big list of 3 items as representing three rows, and the sub-lists in each row as representing four columns of data.

And that's exactly how we're going to use them.

cols 0 1 2 3 +------------------- 0 | 1 2 3 4 rows 1 | 5 6 7 8 2 | 9 10 11 12

4.5.4. Setting up a list-of-lists in Python

Here's one way to fill up a list of lists, assuming we want it to be height rows high and width cols across:

lst = [] height = 3 width = 4 for row in range(height): lst.append([]) # create an empty row for col in range(width): lst[row].append(0) # create a column with 0 in it

4.5.5. Printing out a list-of-lists in Python

Once you've got a list of lists all set up, you can print it all out using something like this:

for row in range(height): for col in range(width): print(lst[row][col], end=' ') print()

4.6. Debugging

From your very first program you know how easy it is to make mistakes when trying to get the computer to do something for you.

You may already have developed some sense of what kind of strategy you can use to figure out what's not working. Nevertheless, given that we spend so much of our time debugging, double-check that you're being as efficient as possible when fixing your code.

Here's how.

4.6.1. Debugging Strategies

Our programs have started to get pretty complex... and they're only going to get more complex from here on out. It's useful to have a few tricks that you can use when writing and debugging your program. You'll quickly learn the advantages of each of these as our problems become more difficult.

Coding and Debugging Strategies

Break Point

A break point is a stopping place in a program that is used during debugging to understand the status of the program at that point.

Integrated Development Environments (IDEs) provide a programmer with the capability to create breakpoints automatically, but because we're coding "by hand," you'll need to write your own breakpoints. Here's how you do that:

  1. Insert a print statement to indicate the status of variables that you're concerned about:
    print("DEBUG: a =",a,"and b=",b) print(locals()) # an easier, fancier way of doing this in Python
  2. Insert an input statement so that the program stops to wait for user input:
    print("DEBUG: a =",a,"and b=",b) print(locals()) # an easier, fancier way of doing this in Python pause = input()

We don't actually care about any input here, but this statement has the effect of pausing the program for a moment so that we can read the status of the variables, and hopefully figure out what might be causing the program to misbehave.

Once the problem has been solved, you can comment out or remove these statements entirely so that only the clean code is in your program.

4.6.2. Example of Debugging: Binary Search

If I'm writing the Binary Search program described above, I might think I have a pretty good idea of how my program is working, and yet... I'm not getting the right results. You don't need to know about the Binary Search algorithm to understand the point here. Just look at my (not working yet) code initially:

lower = 0 # position of lowest value in search space upper = len(sortedList) - 1 # position of highest value in search space middle = ...? # position of middle value in search space searching = True # flag to let us know we're still looking while (searching): if upper < lower: found = False searching = False if sortedList[middle] == searchItem: found = True searching = False elif sortedList[middle] > searchItem: upper = middle - 1 middle = (upper + lower) // 2 else: lower = middle + 1 middle = (upper + lower) // 2

This version of the program is producing output but the results are incorrect. How can I go about diagnosing what's going wrong with my program?

Here's a new version of the program, with additional print statements and a pause built in so I can see what's going on. The breakpoint code is highlighted in bold.

# NOTE: There is an error in this code! lower = 0 upper = len(sortedList) - 1 middle = (lower + upper) // 2 searching = True while (searching): print("Locations for L, M, and U are:",lower,middle,upper) print("We're searching between",sortedList[lower],"and",sortedList[upper]) if upper == lower: print("Your item is not on our list! :(") found = False searching = False if sortedList[middle] == searchItem: print("We found it at location", middle, "!") found = True searching = False elif sortedList[middle] > searchItem: print(searchItem,"is less than",sortedList[middle]) upper = middle - 1 middle = (upper + lower) / 2 else: print(searchItem,"is greater than",sortedList[middle]) lower = middle + 1 middle = (upper + lower) / 2 pause = input("")

All those extra print statements slow down the program, and having to hit the [Enter] key every time the loop is definitely not a feature that I'll want in my final version. But in the meantime, I can use the breakpoint—the print statements along with the pause—to let me examine the inner workings of the program and figure out what's going on. (It turns out that I should have said if upper < lower, not if upper == lower.) And once I've got it working, I'll either remove those statements completely, or just comment them out in case I think I might need to use them again later on when I add a new feature to the program.

4.7. Review of the First Quarter

This just about brings us to the end of some of our major work in learning how to program. At this point you know the major strategies and structures for writing instructions in Python.

Here's a quick summary of what we've covered already in just the first few weeks of study.

Some major elements of programming in Python

There is a lot more to programming in Python than this, but we've made a very good start, and we're already in a position to write some complex and powerful programs.