Updated: 2015-02-12 02:20 EST

The 100 problems of using regular expressions

The 100 problems of using regular expressions

1 Due Date and Deliverables

Do not print this assignment on paper!

WARNING: Some inattentive students upload Assignment #03 into the Assignment #02 upload area. Don’t make that mistake! Be exact.

2 Purpose of this Assignment

Do not print this assignment on paper! On paper, you cannot follow any of the hyperlink URLs that lead you to hints and course notes relevant to answering a question.

  1. Practise with anchored extended regular expressions of varying complexity
  2. Create simple shell scripts
  3. Practice with a text editor

3 Introduction and Overview

This is an overview of how you are expected to complete this assignment. Read all the words before you start working.

For full marks, follow these directions exactly.

  1. Complete the Tasks listed below.
  2. Verify your own work before running the Checking Program.
  3. Run the Checking Program to help you find errors.
  4. Submit the output of the Checking Program to Blackboard before the due date.
  5. READ ALL THE WORDS to work effectively and not waste time.

You are given a file of somewhat random text, and a set of descriptions of sets of lines in that file. For each description, you are to produce a grep -E command with one single anchored extended regular expression that will select the described set of lines. You will initially test your regular expressions on the interactive shell command line, and when you are satisfied with each one, you will put the command you used into a shell script and have it read a test file.

You can use a Checking Program to check your work as you do the tasks. You can check your work with the checking program as often as you like before you submit your final mark. (Some tasks sections below require you to finish the whole section before running the checking program; you may not always be able to run the checking program successfully after every single task step.)

Since I also do manual marking of student assignments, your final mark may not be the same as the mark submitted using the current version of the Checking Program. I do not guarantee that any version of the Checking Program will find all the errors in your work. Complete your assignments according to the specifications, not according to the incomplete set of the mistakes detected by the Checking Program.

3.1 Save your work

You will create file system structure in your HOME directory on the CLS, with various directories, files, and links. When you are finished the tasks, leave these files, directories, and links in place as part of your deliverables on the CLS. Do not delete any assignment work until after the term is over! Assignments may be re-marked at any time; you must have your term work available right until term end.

3.2 The Source Directory

All references to the Source Directory below are to the CLS directory ~idallen/cst8177/15w/assignment04/ and that name starts with a tilde character ~ followed by a user name with no intervening slash. The leading tilde indicates to the shell that the pathname starts with the HOME directory of the account idallen (seven letters).

You do not have permission to list the names of all the files in the Source Directory, but you can access any files whose names you already know.

3.3 Searching the course notes on the CLS

The previous term’s course notes are available on the Internet here: CST8207 GNU/Linux Operating Systems I. All the notes files are also searchable on the CLS. You can recall how to read and search these files using the command line on the CLS under the heading Copies of the CST8207 course notes near the bottom of the page Course Linux Server Course Notes.

4 Tasks

For full marks, follow these task directions below exactly as written. READ ALL THE WORDS to work effectively and not waste your time.

  1. Complete the Tasks listed below, in order, from top to bottom.
  2. Do not skip steps.
  3. These tasks must be done in your account on the Course Linux Server.
  4. Verify your own work before running the Checking Program.
  5. Run the Checking Program to help you find errors and grade your work.
  6. Submit the grading output of the Checking Program to Blackboard before the due date.

Your instructor will also mark on the due date the work you do in your account on the CLS. Leave all your work on the CLS and do not modify it. Do not delete any assignment work from the CLS until after the course is over.

4.1 Set Up – The Base Directory on the CLS

  1. Do a Remote Login to the Course Linux Server (CLS) from any existing computer, using the host name appropriate for whether you are on-campus or off-campus. All work in this assignment must be done on the CLS.

  2. Make the CLS directory ~/CST8177-15W/Assignments/assignment04, in which you will create the files for the following tasks.

This CLS assignment04 directory is the Base Directory for most pathnames in this assignment. Store your files and answers in this Base Directory on the CLS.

  1. Create the check symbolic link needed to run the Checking Program, as described in the section Checking Program below.

  2. The input text file test_input.txt in the Source Directory contains many lines of text. Put a soft link to this input file in your Base Directory. Use the same name for the link.

Use the symbolic link to run the Checking Program to verify your work so far.

4.2 Write Extended Regular Expression Commands

Below, in the Labelled Descriptions section, you are given labelled descriptions of anchored extended regular expressions that match lines.

For each labelled description you will repeat these steps, described in detail below:

Each set of lines to be found is labelled below with a label. The label is the first word in the section, followed by a colon. For example, the following example description is labelled bar:

bar: lines that consist of (only) the single word barbar (and nothing else)

Repeat the following eight steps for each of the labelled descriptions:

4.2.1 Repeat these eight steps for each label

  1. Make your current working directory the Base Directory (the directory containing the new symlink you made to the test_input.txt file) if it is not already so.

  2. Read carefully the label and description of the kind of lines that must be matched. You must write a single grep -E command using a single anchored extended regular expression pattern.

    Unlike the previous assignment, all your regular expressions in this assignment must be anchored so that they match only exactly what the regular expression asks for. For example, if you’re looking for a phone number, your regular expression must look for lines that contain a single phone number, not multiple numbers, and nothing else must appear anywhere else on the line before or after the phone number.

    There are example lines given below for each labelled description that you must match, and lines that you must not match. Test your expression against these example lines before you test it against the large test file.

    Hint: Save the example lines in a file for repeated use.

    Type directly at the shell command line your initial attempt at a grep -E command that finds the lines, and view the result on your screen. No pipes, multiple expressions, or other options are allowed. Use only a single grep -E command with a single anchored extended regular expression.

  3. If you’re not satisfied with the output you see, use up-arrow to retrieve the previous command, and make changes to the extended regular expression, then re-run the new command. Repeat the this step over and over on the interactive command line until you’re satisfied with the output on your screen and want to check your answer.

    When you are confident that your expression is correct, use your command to find lines in the large test_input.txt file and verify the word count and checksum.

    For the example given above with the label bar and word barbar, a grep -E command with an anchored extended regular expression that would work would be:

    $ grep -E '^(bar){2}$' test_input.txt

    The following would all be incorrect solutions, based on the above requirements:

    $ grep -E '(bar){2}$' test_input.txt           # WRONG
    $ grep -E '^(bar){2}' test_input.txt           # WRONG
    $ grep -E '(bar){2}' test_input.txt            # WRONG
    $ grep -E '^.*(bar){2}.*$' test_input.txt      # WRONG
    $ grep -E '^ *(bar){2} *$' test_input.txt      # WRONG
    $ grep -E '^Barbar$' test_input.txt            # WRONG
    $ grep -E -w '^(bar){2}$' test_input.txt       # WRONG
    $ grep -E -i '^(bar){2}$' test_input.txt       # WRONG

    The correct lines of output on your screen for each problem below will vary between a few and a few hundred lines, depending on the problem.

  4. To check your answer against the big test_input.txt file, use up-arrow to retrieve the command, and modify it to pipe the output of your command into the wc program, then do the same, changing wc to sum. Compare the output of wc and sum with the expected values output by the Checking Program.

    For the example given above with the label bar, the checking pipelines would be done like this, in this order:

    $ grep -E '^(bar){2}$' test_input.txt
    $ grep -E '^(bar){2}$' test_input.txt | wc
    $ grep -E '^(bar){2}$' test_input.txt | sum

    The '^(bar){2}$' string is the quoted, anchored, extended regular expression.

  5. If the word count or checksum values differ from those expected values output by the Checking Program, you need to fix your extended regular expression. Use up-arrow to retrieve the command, make your changes to the extended regular expression, and re-run the command until you get it right.

    Hint: Always check your expression against the given example lines first, before you test against the large input file.

    Do not save the output of the Checking Program; the test file may change at any time to include new test cases, so the word count and checksums may change at any time.

  6. When you are satisfied with your answer as typed on the command line, use a text editor to create in your Base Directory an executable shell script whose name is the label name followed by an .sh extension, e.g. bar.sh. Copy the working grep -E command from the command line into the last line of the new shell script. Only put the grep -E command into the script, not any pipelines or checking. This executable script must run only your grep -E command using test_input.txt as the file name.

    For the example given above with the label bar, the script name must be bar.sh in the Base Directory.

    The first three lines of every shell script must correspond exactly to the Script Header described in class.

    The last line of every script will be your grep -E command. Do not redirect or pipe the output of your command into anything inside the script – the script should produce the correct lines of output from test_input.txt on your screen so that it can be checked.

    Do not put any other command lines into your script other than the Script Header and the single grep -E command line.

  7. You can also check the output of your script using the wc and sum commands, similar to the way you checked the original grep command. The script must output exactly the same lines as the original grep command that you put into it. The results should be identical:

    $ grep -E '^(bar){2}$' test_input.txt | wc
    $ ./bar.sh                            | wc
    
    $ grep -E '^(bar){2}$' test_input.txt | sum
    $ ./bar.sh                            | sum
  8. Add two comments lines to the script file, just above the grep command line:

    1. A comment that gives a non-blank sample line that would be matched by the script. It starts with # Sample Match: and would look similar to this example comment:

      # Sample Match: barbar
    2. A comment that gives a non-blank sample line that would not be matched by the script. It starts with # Non-Match: and would look similar to this example comment:

      # Non-Match: barbarbar

    The two samples in your comments must not copy any examples found in this assignment file. Invent your own unique matching and non-matching examples; do not copy mine.

Repeat the above eight steps for each of the Labelled Descriptions below.

NOTE: When it comes time to create your second and subsequent scripts, copy the previous script to the new label name rather than starting from scratch every time. Run the Checking Program to make sure you have copied the Script Header correctly.

Do not put any lines into your script other than the Script Header, the two mandatory comment lines, the single grep -E command line, and any additional optional blank or comment lines you might want.

Your scripts must give the correct output word count and checksum results when searching in this test_input.txt test file. If the output is incorrect, you will be told what the correct values should be in the error message. Do not save this message – the testing file may change at any time during the assignment and your scripts must still match the correct lines.

Write the anchored extended regular expressions to match the given pattern specifications, not to match the particular set of lines in the given test file(s). I may come up with other test cases even after the due date of the assignment; your script loses marks if it fails these tests because it doesn’t do what the specification says it must do. You may have to write your own test cases, to be sure you got it right.

I’ve also set up the checking program to detect failure to protect special characters from shell GLOB expansion. If your expression works in your account but not when the checking script runs it, lack of quoting may be your problem. You may also see “Permission denied” errors if quoting is your problem. Fix your script to hide special characters from the shell.

4.3 Labelled Descriptions

All the points below have the following format:

Here are the names of the patterns (and scripts) you must create:

4.3.1 password

  1. password: Lines consisting of a single word password or passwd, with any of the first four letters optionally capitalized.

    These should match:

    password
    Password
    PASSword
    passwd
    pasSwd
    pAsSwd

    These should not match:

    Pass
    passwD
    passWord
    passw
    passd
    ppaswd
    PPassword
    Here is my password.
    Passwd is spelled badly.
    <empty line>

4.3.2 names

  1. names: Lines consisting of a single name of a person, first letters (only) capitalized, with capitalized optional middle name, separated by a single space. Any alphabetic string is acceptable as a name. Only the first letter of each name may be a capital letter.

    These should match:

    John Smith
    John Yeardly Smith
    A B
    A B C
    Aabc Cdef Z
    Abc Def Ghi
    Abc Def

    These should not match:

    john Smith
    John  Smith
    John YeardlySmith
    JOHN SMITH
    Ab Cd EE
    a B
    A b C
    A B c
    A b
    abc Def
    Abc Def ghi
    A  B C
    A B C D
    Abc Def G hi
    My name is John Smith today.
    <empty line>

4.3.3 zipcode

  1. zipcode: Lines consisting of a single numeric USA zip code of the form 99999 or 99999-9999, with zeros allowed everywhere.

    These should match:

    12345
    00000
    01234
    23456-0000
    99999-0001
    00000-0000

    These should not match:

    123456
    1233-4444
    000-00000
    2345-34568
    23456-34568
    My zipcode is 12345 at home.
    <empty line>

4.3.4 hour12

  1. hour12: Lines consisting of a single one or two digit integer between 1 and 12 (inclusive), for valid 12-hour times. Hint: create one regular expression that matches numbers between 1 and 9, and another regular expression that matches numbers between 10 and 12, and combine those with alternation. (Note: 0 is not a valid 12-hour time.)

    These should match:

    1
    01
    2
    02
    9
    09
    10
    11
    12

    These should not match:

    0
    00
    20
    13
    90
    001
    011
    120
    012
    Noon is 12 noon.
    <empty line>

4.3.5 daynum

  1. daynum: Lines consisting of a single integer between 1 and 31, for a valid day of a month. Hint: create one regular expression that matches numbers between 1 and 9 with optional leading zero, and another that matches numbers between 10 and 29, and one that matches numbers between 30 and 31, then use alternation to combine those three.

    These should match:

    1
    01
    9
    09
    10
    20
    30
    31

    These should not match:

    0
    00
    001
    100
    32
    031
    310
    56
    33
    100
    Has 31 days.
    <empty line>

4.3.6 hour24

  1. hour24: Lines consisting of a single non-negative integer less than 24, for a valid hour in 24-hour times. Hint: See the hints for the previous questions.

    These should match:

    0
    1
    2
    3
    00
    01
    05
    09
    10
    15
    20
    23

    These should not match:

    000
    100
    023
    009
    012
    24
    25
    045
    The number 23 is good.
    <empty line>

4.3.7 minutes

  1. minutes: Lines consisting of a single two digit integer less than 60, for valid minutes or seconds in a time.

    These should match:

    00
    01
    09
    10
    11
    20
    30
    59

    These should not match:

    000
    010
    200
    60
    070
    0
    9
    90
    1
    The answer 42 is right.
    <empty line>

4.3.8 decimal

  1. decimal: Lines consisting of a single unsigned decimal or floating point number.

    These should match:

    000
    0000.0000
    8.45
    2.768
    0.320
    .320
    96

    These should not match:

    .
    45.
    1..2
    PI is nearly 3.1416 but not really.
    <empty line>

4.3.9 currency

  1. currency: Lines consisting of a single dollar amount, starting with the leading dollar sign and optional two-digit cents.

    These should match:

    $0
    $1
    $12
    $.12
    $0.12
    $0000.12
    $1234.56

    These should not match:

    1
    12
    .12
    0.12
    1234.56
    1.
    1.2
    1.23
    $
    $.
    $.1
    $1.2
    $1.234
    I earn $10 an hour.
    <empty line>

4.3.10 date

  1. date: Lines consisting of a single date with syntax YYYY-MM-DD where the year (YYYY) is exactly 4 digits, the month (MM) is between 1 and 12, two digits maximum, and the day (DD) is between 1 and 31, two digits maximum. The day does not have to be accurate for February, June, leap years, etc.; it only has to be a number between 1 and 31, two digits maximum. Hint: Combine and re-use your work and hints from earlier questions!

    These should match:

    0000-01-01
    2014-1-1
    2014-1-15
    2014-01-31
    2014-02-31
    0000-6-31
    2014-12-31

    These should not match:

    0000-00-00
    0000-00-01
    2000-13-01
    2000-12-00
    20000-12-01
    2000-012-12
    2000-12-012
    2000-12-120
    2014-01-32
    Today is 2015-01-23 all day.
    <empty line>

4.3.11 time24hr

  1. time24hr: Lines consisting of a single 24-hour time with optional seconds with syntax HH:MM[:SS] where minutes and seconds must have exactly two digits.

    These should match:

    02:23
    2:23
    2:23:59
    12:23:59
    23:23
    00:00:00
    00:00
    00:00:59
    01:01:01

    These should not match:

    24:00
    12:60:00
    12:34:56:00
    12:15:60
    012:14:00
    11:59:001
    11:059
    11:059:1
    10:1:10
    The time is 02:23 in the morning.
    <empty line>

4.3.12 time12hr

  1. time12hr: Lines consisting of a single 12-hour based time with optional seconds and AM/PM using syntax HH:MM[:SS][am|AM|pm|PM] where minutes and seconds must have two digits, followed by an optional am, pm, AM, or PM. Hint: Use re-use parts of your hour12 and minute regular expressions from above in your answer. (Note: 00:00am is not a valid 12-hour time.)

    These should match:

    2:24pm
    2:24
    2:24PM
    2:24AM
    2:24am
    02:34
    12:59
    12:56:59
    4:56:56

    These should not match:

    0:00
    00:00
    00:01
    00:00am
    00:59:59PM
    99:99
    002:23
    13:01am
    23:01PM
    2:3pm
    1:2:3pm
    2:24pmpm
    2:24amPM
    2:23Pm
    2:23pM
    2:23aM
    2:23Am
    23:23
    12:23:76
    12:60:34
    1:1
    1:1:1
    10:1:10
    The time is 2:24pm and I nap.
    <empty line>

4.3.13 ipaddr

  1. ipaddr: Lines consisting of a single IPV4 Address of four integers from 0 to 255 separated by dots. Each integer should be three digits or less, and leading zeros are OK.

    Hint: break each of the four integers into an alternation between the following ranges: integers greater or equal to 200 and less than or equal to 255 (which could be done as one group 200 through 249 and a second group 250 through 255), integers from 100 to 199, integers from 10 to 99, integers from 0 to 9.

    These should match:

    255.255.255.255
    1.1.1.1
    02.089.89.001
    0.0.0.0
    00.01.002.000
    23.234.123.123
    12.12.12.12
    012.012.012.012

    These should not match:

    1.1.1.
    1.
    1.1
    1.1.1.1.1
    0234.1.1.1
    234.0234.166.23
    1.1.1
    345.2.2.2
    299.2.2.2
    Broadcast using 255.255.255.255 as your IP.
    <empty line>

Check your work so far using the checking program symlink.

Do not save the wc and sum output of the Checking Program; the test file may change at any time to include new test cases, so the word count and checksums may change at any time.

4.4 When you are done

That is all the tasks you need to do.

Read your CLS Linux EMail and remove any messages that may be waiting. See Reading eMail for help.

Check your work a final time using the Checking Program and save the standard output as described below. Submit your mark following the directions below.

5 Checking, Marking, and Submitting your Work

Summary: Do some tasks, then run the checking program to verify your work as you go. You can run the checking program as often as you want. When you have the best mark, upload the marks file to Blackboard.

Since I also do manual marking of student assignments, your final mark may not be the same as the mark submitted using the current version of the Checking Program. I do not guarantee that any version of the Checking Program will find all the errors in your work. Complete your assignments according to the specifications, not according to the incomplete set of the mistakes detected by the Checking Program.

  1. There is a Checking Program named assignment04check in the Source Directory on the CLS. Create a Symbolic Link to this program named check under your new Base Directory on the CLS so that you can easily run the program to check your work and assign your work a mark on the CLS. Note: You can create a symbolic link to this executable program but you do not have permission to read or copy the program file.

  2. Execute the above check program on the CLS using its symbolic link. (Review the Search Path notes if you forget how to run a program by pathname from the command line.) This program will check your work, assign you a mark, and display the output on your screen. (You may want to paginate the long output so you can read all of it.)

    You may run the check program as many times as you wish, to correct mistakes and get the best mark. Some task sections require you to finish the whole section before running the checking program at the end; you may not always be able to run the checking program successfully after every single task step.

  3. When you are done with checking this assignment, and you like what you see on your screen, redirect only the standard output of the Checking Program into the text file assignment04.txt under your Base Directory on the CLS. Use that exact name. Case (upper/lower case letters) matters. Be absolutely accurate, as if your marks depended on it.
    • Do not edit the output file. Submit it exactly as given.
    • Make sure the file actually contains the output of the checking program!
    • The file should contain near the bottom a line starting with: YOUR MARK for
    • Really! MAKE SURE THE FILE HAS YOUR MARKS IN IT!
  4. Transfer the above assignment04.txt file from the CLS to your local computer and verify that the file still contains all the output from the checking program. Do not edit this file! No empty files, please! Edited or damaged files will not be marked. You may want to refer to your File Transfer notes.
    • Do not edit the output file. Submit it exactly as given.
    • Make sure the file actually contains the output of the checking program!
    • The file should contain near the bottom a line starting with: YOUR MARK for
    • Really! MAKE SURE THE FILE HAS YOUR MARKS IN IT!
  5. Upload the assignment04.txt file from your local computer to the correct Assignment area on Blackboard (with the exact name) before the due date:
    1. On your local computer use a web browser to log in to Blackboard and go to the Blackboard page for this course.
    2. Go to the Blackboard Assignments area for the course, in the left side-bar menu, and find the current assignment.
    3. Under Assignments, click on the underlined assignment04 link for this assignment.
      1. If this is your first upload, the Upload Assignment page will open directly; skip the next sentence.
      2. If you have already uploaded previously, the Review Submission History page will be open and you must use the Start New button at the bottom of the page to get to the Upload Assignment page.
    4. On the Upload Assignment page, scroll down and beside Attach File use Browse My Computer to find and attach your assignment file from your local computer. Make sure the assignment file has the correct name on your local computer before you attach it.
    5. After you have attached the file on the Upload Assignment page, scroll down to the bottom of the page and use the Submit button to actually upload your attached assignment file to Blackboard.

    Use only Attach File on the Upload Assignment page. Do not enter any text into the Text Submission or Comments boxes on Blackboard; I do not read them. Use only the Attach File section followed by the Submit button. If you need to comment on any assignment submission, send me EMail.

    You can revise and upload the file more than once using the Start New button on the Review Submission History page to open a new Upload Assignment page. I only look at the most recent submission.

    You must upload the file with the correct name from your local computer; you cannot correct the name as you upload it to Blackboard.

  6. Verify that Blackboard has received your submission: After using the Submit button, you will see a page titled Review Submission History that will show all your uploaded submissions for this assignment. Each of your submissions is called an Attempt on this page. A drop-down list of all your attempts is available.
    1. Verify that your latest Attempt has the correct 16-character, lower-case file name under the SUBMISSION heading.
    2. The one file name must be the only thing under the SUBMISSION heading. Only the one file name is allowed.
    3. No COMMENTS heading should be visible on the page. Do not enter any comments when you upload an assignment.
    4. Save a screen capture of the Review Submission History page on your local computer, showing the single uploaded file name listed under SUBMISSION. If you want to claim that you uploaded the file and Blackboard lost it, you will need this screen capture to prove that you actually uploaded the file. (To date, Blackboard has never lost an uploaded file.)

    You will also see the Review Submission History page any time you already have an assignment attempt uploaded and you click on the underlined assignment04 link. You can use the Start New button on this page to re-upload your assignment as many times as you like.

    You cannot delete an assignment attempt, but you can always upload a new version. I only mark the latest version.

  7. Your instructor may also mark files in your directory in your CLS account after the due date. Leave everything there on the CLS. Do not delete any assignment work from the CLS until after the term is over!

READ ALL THE WORDS. OH PLEASE, PLEASE, PLEASE READ ALL THE WORDS!

Knowing regular expressions saves the day

Knowing regular expressions saves the day

Author: 
| Ian! D. Allen  -  idallen@idallen.ca  -  Ottawa, Ontario, Canada
| Home Page: http://idallen.com/   Contact Improv: http://contactimprov.ca/
| College professor (Free/Libre GNU+Linux) at: http://teaching.idallen.com/
| Defend digital freedom:  http://eff.org/  and have fun:  http://fools.ca/

Plain Text - plain text version of this page in Pandoc Markdown format

Campaign for non-browser-specific HTML   Valid XHTML 1.0 Transitional   Valid CSS!   Creative Commons by nc sa 3.0   Hacker Ideals Emblem   Author Ian! D. Allen