WORKSHOP - REGULAR EXPRESSIONS for MERE MORTALS

Definition: Regular Expression

A regular expression is a coded string of characters that allows you to find and replace patterns in a text document.

If you have text—a list of items, say, or some other text document—regular expressions can be used to interact with that text.

1. Install a Text Editor

We're working with bare text--no fancy formatting, because things are complicated enough. You can figure out how you want to format things later.

Working with text, we need to use a "text editor."

Download and install VS Code for your Mac or PC.

Now we can create text-based documents quickly and easily. Fantastic for taking notes at meetings, etc.

2. Configure VS Code

Text editors are so enormously powerful that there are many, many options that can be configured. We need to change only a couple of the default options, however:

  1. Turn word wrap on: Code > Preferences > Extensions. Search for "Word Wrap", turn it on.
  2. Install an extension that will allow us to sort lines: Code > Preferences > Extensions. Search for "Sort Lines", and install the package by Daniel Imms.

3. Using a text editor

A text editor works just like Word or Google Docs: you can type or paste text into it, and you can copy stuff in it or save it in files.

Interesting point: This particular text editor doesn't allow you to print, but that's okay—we're just using it as a tool for doing other things. If you really need to print something, you can copy-paste it into Notepad (PC), TextEdit (Mac), or Word or Google Docs and print it from there.

So that we have something to work with, select-copy-paste this text selection and paste it into a New document in VS Code. Or, download the directory.txt file. (I copied this from Poly's Administrative webpage.)

Photo of John Bracker John Bracker Dept: Administration, Head of School, Staff Title: Head of School Business Phone: 626-396-6301 Photo of Steve Beerman Steve Beerman Dept: Administration, Athletics, Middle School Physical Education, Lower School Physical Education Title: Director of Athletics Business Phone: 626-396-6461 Read Bio Photo of Leslie Carmell Leslie Carmell Dept: Administration, Communications, Staff Title: Director of Communications Business Phone: 626-396-6345 Read Bio Photo of Sean Dwyer Sean Dwyer Dept: Administration, Business, Staff Title: Director of Human Resources and Risk Management Business Phone: 626-396-6352 Read Bio Photo of Jonathan Fay Jon Fay Dept: Administration, Mathematics, Summer School Title: Director of Summer and Extended Day Programs; Upper School Mathematics Business Phone: 626-396-6310 Read Bio Photo of Gregory Feldmeth Greg Feldmeth Dept: Administration, History Title: Assistant Head of School; Co-chair, History Department; Upper School History; Global Online Academy Business Phone: 626-396-6610 Read Bio Photo of Jennifer Fleischer Jennifer Fleischer Dept: Administration, Staff Title: Upper School Director Business Phone: 626-396-6601 Photo of Patrick Gray Pat Gray Dept: Administration, Middle School - All, Staff Title: Middle School Director; Coach Business Phone: 626-396-6501 Photo of Keith Huyssoon Keith Huyssoon Dept: Administration, Business, Staff Title: Chief Financial Officer Business Phone: 626-396-6351 Read Bio Photo of Paula Martin Paula Martin Dept: Lower School - All, Administration Title: Lower School Director Business Phone: 626-396-6401 Read Bio

4. Regular Expressions

Now things get fun!

"Regular expressions" are ways of describing patterns of text using symbolic identifiers.

Setting up regular expression mode

Activate the "Find menu", then activate "regular expression mode" by clicking on the .* button.

4.a. Simple Find

You can still perform simple find commands.

Simple Find

  1. Activate the Find feature by pressing command-f (Mac) or ctrl-f (PC).
  2. Type John into the search field of the Find dropdown to see occurrences of that text highlighted.

4.b. Find using regular expressions

The special character ^ is used to find the beginning of a line.

Find only occurrences of "John" at the beginning of a line

Type ^John into the search field of the Find dropdown to see occurrences of that text highlighted.

Another special character is the $, which indicates the end of a line.

"er" in the document

Find all occurrences of er in the document.

Then find only those that occur at the end of a line. (Note that the end of the line $ occurs after the er, so you'll be searching for er$.)

5. Common Regular Expression symbols

By composing regular expression patterns, we can search for increasingly complex collections of information.

Some Regular Expressions Rules

  1. Most characters match themselves.
  2. ^ and $ refer to the beginning and end of a line (as discussed).
  3. . symbolizes any character. If you actually want to match a ".", use \.
  4. * means 0 or more of something
  5. + means 1 or more of something
  6. Sets of characters can be enclosed in [].
    • [xyz] matches any of the characters x, y, or z
    • [a-z] matches any of the characters from a to z
    • [^xyz] matches anything but the characters x, y, or z
  7. Non-printing characters include \n, \t (newline and tab)
  8. Special characters include:
    • \s (whitespace characters like spaces and tabs)
    • \w (word characters including letters and numbers)
    • \d (digits)
  9. Curly braces indicate how many of a pattern are being matched.
    • \s{2} matches exactly 2 whitespace characters
    • \s{2,} matches 2 or more whitespace characters
    • \s{2,4} matches 2-4 whitespace characters
  10. Parentheses ( ) can be used to form a series of logical groups, which can then be referred to as \1, \2, \3 in replace patterns. (Note that VS Code uses $1, $2, etc.)

A summary of these rules along with some examples is available on this PDF.

These patterns all match the given expression

Can you explain why?

\w\w\w → "ABC" \w+ → "ABC" \w* → "ABC" .* → "ABC" ..C → "ABC" .+@.+ → "rwhite@crashwhite.com" .+@.+\..+ → "rwhite@crashwhite.com"

These patterns don't match the given expression

Can you explain why?

\w\w\w → "AB" \w+ → "Richard White" \d\w\d → "2+2"

What about these?

  1. Will the pattern
        \w+, ..

    find the line
        Pasadena, CA
    ?
  2. The pattern
        [A-Z][A-Z]

    matches which part of the line
        I work in Pasadena, CA 
    ?

Just the telephone numbers

There are a lot of ways that you could find just the telephone numbers...

6. Grouping and Replacing

In some cases it will be convenient to create groups of patterns, a "subpattern," that you can refer to.

6.a. Subpatterns can be created using parentheses.

Telephone numbers could be identified by this pattern: (\d+)-(\d+)-(\d+)

Super Advanced: Social Security numbers would also match that pattern, however. If you really want just 3 digits, 3 digits, 4 digits, you'd do it like this: (\d{3})-(\d{3})-(\d{4})

6.b. Subpatterns can be replaced into a result.

The first subpattern in your regular expression can be used in the replace method by entering $1, the second subpattern is $2, and so on. This can be enormously useful.

Search/Replace

What will this regular expression search-replace pattern produce when run on the following list?

Search: (\w+), (\w+) Replace: $2 $1 Fleischer, Jennifer Livingstone, Ryder Gladden, JD

7. Activities

Activity 1. Creating a phone directory

Although the text above has a nice collection of information, I want to create just a phone list from that data, alphabetized by person.

Strategy:

  1. Try to clean up the data as best you can before getting started. It looks like there are some random tabs in here, and some spaces at the beginning of some lines, or some extra lines. Find those abnormalities and remove them. (You can remove a matched pattern by just replacing it with nothing.)
  2. Using regular expression find-replace, remove all the lines that begin with the word "Photo" and end with a new line character "\n".
  3. Do the same with lines beginning with "Dept".
  4. Do the same for lines beginning with "Title".
  5. Remove all the "Read Bio" lines.
  6. Now, let's put the names and numbers on the same line, separated by a dash:
    Find: \nBusiness Phone: Replace: -
  7. Let's reverse the order of the two names so that we can alphabetize them:
  8. Find: (\w+) (\w+) Replace: $2, $1
  9. And finally, alphabetize the list: Code > View > Command Palette and type sort to find the "sort lines" commands.

Activity 2. Stealing email addresses

Although the text above was copied from the website, the email addresses listed on that page weren't copied due to some fancy anti-spam measures built into the webpage. We want to get a list of the email addresses for those administrators anyway so that we can spam them.

Strategy:

  1. Try to clean up the data as best you can before getting started. It looks like there are some random tabs in here, and some spaces at the beginning of some lines, or some extra lines. Find those abnormalities and remove them. (You can remove a matched pattern by just replacing it with nothing.)
  2. Using regular expression find-replace, remove all the lines that begin with the word "Photo" and end with a new line character "\n".
  3. Do the same with lines beginning with "Dept".
  4. Do the same for lines beginning with "Business".
  5. Do the same for lines beginning with "Title".
  6. Remove all the "Read Bio" lines.
  7. Now, let's convert those names to email address:
    Find: (\w)(\w+) (\w+) Replace: $1$3@polytechnic.org
  8. Finally, put all those emails together into one line, separated by commas:
    Find: \n Replace: ,