Updated: 2016-04-22 11:06 EDT

The 100 problems of using regular expressions

The 100 problems of using regular expressions

1 Due Date and DeliverablesIndexup to index

Do not print this assignment on paper!

WARNING: Some inattentive students upload Assignment #12 into the Assignment #11 upload area. Don’t make that mistake! Be exact.

2 Purpose of this AssignmentIndexup to index

Do not print this assignment on paper! On paper, you cannot follow any of the hyperlink URLs that lead you to hints and course notes relevant to answering a question.

This assignment is based on your weekly Class Notes.

  1. Practise with regular expressions of varying complexity
  2. Create simple shell scripts

Remember to READ ALL THE WORDS to work effectively and not waste time.

3 Introduction and OverviewIndexup to index

This is an overview of how you are expected to complete this assignment. Read all the words before you start working.

You are given a file of somewhat random text, and a set of descriptions of sets of lines in that file. For each description, you are to produce a command with a regular expression that will select the described set of lines. You will initially test your regular expressions on the interactive shell command line, and when you are satisfied with each one, you will put the command you used into a shell script.

You can use a Checking Program to check your work as you do the tasks. You can check your work with the checking program as often as you like before you submit your final mark. (Some tasks sections below require you to finish the whole section before running the checking program; you may not always be able to run the checking program successfully after every single task step.)

3.1 Save your workIndexup to index

You will create file system structure in your HOME directory on the CLS, with various directories, files, and links. When you are finished the tasks, leave these files, directories, and links in place as part of your deliverables on the CLS. Do not delete any assignment work until after the term is over! Assignments may be re-marked at any time; you must have your term work available right until term end.

3.2 The Source DirectoryIndexup to index

All references to the Source Directory below are to the CLS directory ~idallen/cst8207/16w/assignment12/ and that name starts with a tilde character ~ followed by a user name with no intervening slash. The leading tilde indicates to the shell that the pathname starts with the HOME directory of the account idallen (seven letters).

You do not have permission to list the names of all the files in the Source Directory, but you can access any files whose names you already know.

3.3 Searching the course notes on the CLSIndexup to index

All course notes are available on the Internet and also on the CLS. You can learn about how to read and search these CLS files using the command line on the CLS under the heading Copies of the CST8207 course notes near the bottom of the page Course Linux Server.

4 TasksIndexup to index

For full marks, follow these task directions below exactly as written. READ ALL THE WORDS to work effectively and not waste your time.

  1. Complete the Tasks listed below, in order, from top to bottom.
  2. Do not skip task steps. (But you can do the Labelled Descriptions in any order.)
  3. These tasks must be done in your account on the Course Linux Server.
  4. Verify your own work before running the Checking Program.
  5. Run the Checking Program to help you find errors and grade your work.
  6. Submit the grading output of the Checking Program to Blackboard before the due date.

Your instructor will also mark on the due date the work you do in your account on the CLS. Leave all your work on the CLS and do not modify it. Do not delete any assignment work from the CLS until after the course is over.

4.1 Set Up – The Base Directory on the CLSIndexup to index

You must keep a list of command names used each week and write down what each command does, as described in the List of Commands You Should Know. Without that list to remind you what command names to use, you will find assignments very difficult.

  1. Do a Remote Login to the Course Linux Server (CLS) from any existing computer, using the host name appropriate for whether you are on-campus or off-campus. All work in this assignment must be done on the CLS.

  2. Base Directory: Make the CLS directory named ~/CST8207-16W/Assignments/assignment12, in which you will create the files and scripts resulting from the following tasks. (You do not have to create any directories that you have already created in a previous assignment.) Spelling and capitalization must be exactly as shown:

check

  1. Create the check symbolic link needed to run the Checking Program, as described in the section Checking Program below.

This assignment12 directory is called the Base Directory for most pathnames in this assignment. Store your files and answers in this Base Directory, not in your HOME directory or anywhere else.

Use the symbolic link to run the Checking Program to verify your work so far.

4.1.1 Checking only one of your scriptsIndexup to index

Normally the Checking Program checks all the scripts. This can be slow if you are only interested in the check output for one script that you are working on. You can now check just one or more individual scripts by giving the script names as arguments to the checking program:

$ ./check nothing.sh                    # only check this script
$ ./check nothing.sh algid.sh           # only check these scripts

Do not submit for marking the output of checking only a few scripts!

test_input.txt

  1. The input text file test_input.txt in the Source Directory contains many lines of text. Put a soft (symbolic) link to this input file in your Base Directory. Use the same name for the link.

Use the symbolic link to run the Checking Program to verify your work so far.

4.2 Write Regular Expression CommandsIndexup to index

You need to understand Regular Expressions to do this task.

Below, in the Labelled Descriptions section, you are given labelled descriptions of lines to find in the input text file test_input.txt. For each labelled description you will repeat these two steps (described in detail below):

  1. On the command line, invent a grep or egrep command using a single Regular Expression that will select and display only the described lines of text, and only those lines, from the input file. Do not use any options to grep (except --color=auto if you want it). You do not need multiple expressions. You do not require, but you may use if you prefer, extended Regular Expressions.
  2. Put the working grep or egrep command into its own shell script.

Each set of lines to be found is labelled below with a label. The label is the first word in the section, followed by a colon. For example, the following example description is labelled bar:

bar: lines that contain the word barbar

Repeat the following eight steps for each of the labelled descriptions:

4.2.1 Repeat these eight steps for each labelIndexup to index

  1. Make your current working directory the Base Directory (the directory containing the new symlink you made to the test_input.txt file) if it is not already so.

  2. You must find lines in the test_input.txt file using a single grep command with a regular expression pattern. Type directly at the command line your initial attempt at a grep command that finds the lines, and view the result on your screen.

    For the example given above with the label bar, a grep command you might try to match lines containing barbar could be:

    $ grep 'barbar' test_input.txt

    The correct answer output on your screen for each problem below will vary between a few lines and few dozen lines, depending on the problem. Look at the output you get – is it correct?

    No pipes are allowed. Use only a single grep or egrep command, imitating the above command format. No options except --color=auto are allowed.

  3. If you’re not satisfied with the output you see, use up-arrow to retrieve the previous command, and make changes to the regular expression, then re-run the new command. Repeat the this step over and over on the interactive command line until you’re satisfied with the output on your screen and want to check your answer.

  4. To check your answer, use up-arrow to retrieve the command, and modify it to pipe the output of your command into the wc program, then do the same, changing wc to sum. Compare the output of wc and sum with the expected values output by the Checking Program for that question script, like this:

    $ ./check bar.sh
    [...]
    bar test_input.txt: wc should be 7 7 63 sum should be 43848

    The file bar.sh does not need to exist yet; you will always get the expected word count and sum numbers.

    For the example given above with the label bar, the checking pipelines would be done like this, in this order:

    $ grep 'barbar' test_input.txt
    $ grep 'barbar' test_input.txt | wc
    $ grep 'barbar' test_input.txt | sum

    The 'barbar' string is the quoted regular expression. Compare the numbers with the Checking Program output.

  5. If the word count or checksum values differ from those expected values output by the Checking Program, you need to fix your regular expression. Use up-arrow to retrieve the command, make your changes to the regular expression, and re-run the command until you get the same numbers.

    Do not save the output of the Checking Program; the test file may change at any time to include new test cases, so the word count and checksums may change at any time.

  6. When you are satisfied with your answer as typed on the command line, use a text editor to create in your Base Directory an executable shell script whose name is the label name followed by an .sh extension, e.g. bar.sh. Copy the working grep command from the command line into the last line of the new shell script. Only put the grep command, the regular expression, and the file name into the script, not any pipelines or checking. This executable script must run only your grep command with two arguments.

    For the example given above with the label bar, the script name must be bar.sh in the Base Directory. The command you put in the script file would be: grep 'barbar' test_input.txt

    The first three lines of every shell script must correspond exactly to the Script Header described in class.

    The last line of every script will be your grep command. Do not redirect or pipe the output of your command into anything inside the script – the script should produce the correct lines of output from test_input.txt on standard output (your screen) so that it can be checked.

    Do not put any executable lines into your script other than the Script Header and the single grep command line.

  7. You can also check the output of your script using the wc and sum commands, similar to the way you checked the original grep command. The script must output exactly the same lines as the original grep command that you put into it. The results should be identical:

    $ grep 'barbar' test_input.txt | wc
    $ ./bar.sh                     | wc
    
    $ grep 'barbar' test_input.txt | sum
    $ ./bar.sh                     | sum
  8. Add four comments to Document Your Script, then repeat these 8 steps in this section for each of the Labelled Descriptions below.

NOTE: When it comes time to create your second and subsequent scripts, copy the previous script to the new label name and update the comments rather than starting from scratch every time. Run the Checking Program to make sure you have copied the Script Header correctly.

Do not put any executable lines into your script other than the Script Header and the single grep command line.

Your scripts must give the correct output word count and checksum results when searching in this test_input.txt test file. If the output is incorrect, you will be told what the correct values should be in the error message. Do not save this message – the testing file may change at any time during the assignment and your scripts must still match the correct lines.

Write the Regular Expressions to match the given pattern specifications, not to match the particular set of lines in the given test file(s). I may come up with other test cases even after the due date of the assignment; your script loses marks if it fails these tests because it doesn’t do what the specification says it must do. You may have to write your own test cases, to be sure you got it right.

I’ve also set up the checking program to detect failure to protect special characters from shell GLOB expansion. If your expression works in your account but not when the checking script runs it, this may be your problem. You may also see “Permission denied” errors if this is the problem. Fix your script to hide special characters from the shell.

4.3 Labelled DescriptionsIndexup to index

Definition: Whitespace
Spaces or space-like characters such as TABs, newlines, carriage-returns, form-feeds, etc. This is a distinct POSIX character class from blanks, which are only space and TAB. This assignment uses Whitespace, not blanks.

All the points below have the following format:

Here are the names of the patterns (and scripts) you must create:

nothing.sh

  1. nothing: empty lines. (An empty line means nothing on the line, not even Whitespace characters. The line contains no characters. The start of the line and the end of the line are adjacent.)

positive.sh

  1. positive: lines containing at least two adjacent plus + characters. The two characters must be together.

backslash.sh

  1. backslash: lines containing at least one backslash \ character.

asterisk.sh

  1. asterisk: lines containing at least two adjacent asterisk * characters. The two characters must be together.

period.sh

  1. period: lines containing at least one period . character.

startstop.sh

  1. startstop: lines that start with the exact five characters begin and that end with the exact three characters end. (Any other characters might appear between the begin and the end.)

ayebee.sh

  1. ayebee: lines containing A and B, capitalized and in that order but not necessarily right next to each other. Another way of saying this is: lines containing a B following an A.

ottawa.sh

  1. ottawa: lines that contain the string Capital where the initial letter C must be upper-case but the rest of the letters could be either case, e.g. CAPTIAL, CaPiTaL, etc. (You used a similar pattern searching for warez files in an earlier assignment.)

spaceline.sh

  1. spaceline: blank lines. (A blank line contains only zero or more Whitespace characters and no other kinds of characters. This pattern also matches empty lines.)

whitefirst.sh

  1. whitefirst: lines that start with the exact five characters first preceded by any amount of Whitespace. (Hint: Another way of saying this: The line starts with optional Whitespace followed by the string first.)

padfirstlast.sh

  1. padfirstlast: lines that start with the exact five characters first preceded by any amount of Whitespace and that end with the exact four characters last followed by any amount of Whitespace. (Any other characters might appear between the first and the last, but only optional Whitespace is allowed before first and after last.) (Hint: Another way of saying this: The line starts with optional Whitespace, followed by first, followed by anything, followed by last, followed by optional Whitespace, and then the end of the line.)

alphaline.sh

  1. alphaline: non-empty lines containing only alphabetic characters. (“Non-empty” means there has to be at least one alphabetic character.)

notwhite.sh

  1. notwhite: lines, possibly empty, containing no Whitespace characters. (Hint: Another way of saying this is: lines containing zero or more only “non-Whitespace” characters. This pattern also matches empty lines.)

notwhitcap.sh

  1. notwhitcap: lines containing no Whitespace or upper-case characters. (Hint: Another way of saying this is: lines containing zero or more only non-Whitespace non-uppercase characters. This pattern also matches empty lines.)

lowuplow.sh

  1. lowuplow: lines beginning with a lower-case letter and ending with a lower-case letter, with an upper-case letter anywhere in between. (Do not use hyphenated character ranges!)

sevenphone.sh

  1. sevenphone: lines that contain a seven-digit number, surrounded before and after with at least one non-digit character, with one or more underscores, dashes, or periods (only those three characters) between the third and fourth digits. These should match: x555-1212x, x555.1212x, x555_-.1212x, x555--__..-_.1212x but these would not match: x555,1212x, x555;1212x, 555555-----121212121212, x999555-1212x, x555-1212999x x999555-1212999x, 555-121x, x55-1212, 5551212

algid.sh

  1. algid: lines containing only a single lower-case Algonquin student userid and nothing else. (Hints: Student userids are eight characters long. The first two characters are always lower-case letters. The last four characters are always digits. The middle two characters could be either. These should match: abcd0001, abc12345, ab123456 but these would not match: ABCD0001, abcd001, a1234567, abcde123 Since student userids are eight characters, this regexp must only output lines that contain exactly eight characters.)

Check your work so far using the checking program symlink.

Do not save the wc and sum output of the Checking Program; the test file may change at any time to include new test cases, so the word count and checksums may change at any time.

4.4 Counting Failed password usersIndexup to index

You need to understand System Log Files, Redirection (pipes), Control Structures, and Regular Expressions to do this task. You need to know how to handle optional arguments using the control structures given in the previous assignment.

Scenario: Your boss is concerned that people are locking out their IP addresses because they can’t type their passwords correctly. He wants you to provide a list of users who need to be sent to remedial typing training.

badtypist.sh

Write a script named badtypist.sh (in your Base Directory) that outputs the top 30 student userids that had failed password attempts this term, sorted in descending order by the number of failed attempts.

Only output information about valid student userids, not any other userids or invalid userids.

If a single numeric argument is supplied on the command line, output that many userids instead of 30 userids.

Follow proper script-writing procedures regarding script header, argument checking, error messages, etc. I will test your script and try to destroy it with invalid input. If I succeed, zero marks. You can find descriptions of proper error messages in the previous assignment under Good Error Messages.

Write the script a little at a time and test each piece as you add it! Don’t write a huge pipeline and wonder why it doesn’t give any output. Make sure every command in the pipeline produces output for the next one.

Hints:

  1. Look in the system authorization log file for lines that look similar to this (where abcd0001 is any userid):

    Sep  1 00:00:01 idallen-ubuntu sshd[977]: Failed password for abcd0001 from 100.12.195.13 port 51512 ssh2
  2. Extract the userid column from all those lines. (You may find re-using code from your working acol script from the previous assignment useful here.) Some of the extracted lines will not contain a userid if the userid is typed incorrectly.

  3. After you have extracted the userid column, select only lines that contain a lower-case Algonquin student userid. (See the algid script you just wrote, above, for help.)

  4. Use the usual method to group together and count the most frequent occurrences of the lines and sort them in descending order of occurrence. (In the notes on Redirection, refer to counting IP addresses and pay careful attention to the need for two sort commands.)

  5. Output only the top 30 lines (or the number of lines given as the first argument to the script). (You wrote a script that used a default argument in your previous assignment. Follow that model.)

Make sure all the examples below work before you run the checking program! Example:

$ ./badtypist.sh 2 
52 abcd0001
42 abcd0002

$ ./badtypist.sh | wc -l
30

$ ./badtypist.sh 50 | wc -l
50

$ ./badtypist.sh a b c d
...print error and usage messages; see previous assignment...

You can find descriptions of proper error messages in the previous assignment under Good Error Messages.

Add comments to Document Your Script.

Bonus Points: You have the technology available to verify that the first argument to the script is a valid number (i.e. contains only digits). Can you do this?

$ ./badtypist.sh notnumber
...print error and usage messages; see previous assignment...

You can detect non-digits in an argument in several ways. Pick one and use it to prevent non-numbers from being used in the script.

Some ideas for checking for a non-digit:

Hint1: The shell case statement can apply a GLOB pattern match against any text string, e.g. text in an argument. Apply a match for a non-digit and exit the script with an error message if found.

Hint2: The grep family of programs return success if a pattern matches and failure if it does not. You can send text (e.g. an argument) into the standard input of these programs.

4.5 When you are doneIndexup to index

That is all the tasks you need to do.

Read your CLS Linux EMail and remove any messages that may be waiting. See Reading EMail for help.

Check your work a final time using the Checking Program below and save the standard output of that program into a file as described below. Submit that file (and only that one file) to Blackboard following the directions below.

When you are done, log out of the CLS before you close your laptop or close the PuTTY window, by using the shell exit command:

$ exit

5 Document Your ScriptIndexup to index

You must document your script with four comment lines before you submit it. Add four comment lines to each script containing the following four types of information, in the following order:

  1. The assignment number and name (copied exactly from the top of the assignment page).
  2. The script name, e.g. badtypist.sh
  3. Your name, your 9-digit student number, and your Algonquin email address.
  4. The one-line Signing Key for this script file, generated by running the checking program with a first argument of -s and a second argument of the script name, e.g. ./check -s badtypist.sh The Signing Key comment line must start with # $Id: and have $ at the end of the line. The Signing Key is about 60 characters long.

Obey these rules for your script comments:

  1. The block of four comment lines must appear below the standard script header and above your actual script code.
  2. A blank line must separate the block of comment lines from the script header above it and another blank line must separate the block of comments from the script code below it.

Here is a sample four-line comment block for a hypothetical assignment number 99:

# Assignment 99 This is a Sample Comment Block
# foo.sh
# Ian Allen 123456789 abcd0001@algonquinlive.com
# $Id:==wMwATMgI2NxIDO0N3Ygg2cuMHduVWb1dmchByN4YTOxcTO1QTM$

Note the correct placement of the comment block in the script file, as described above!

6 Checking, Marking, and Submitting your WorkIndexup to index

Summary: Do some tasks, then run the Checking Program to verify your work as you go. You can run the Checking Program as often as you want. When you have the best mark, upload the single file that is the output of the Checking Program to Blackboard.

Since I also do manual marking of student assignments, your final mark may not be the same as the mark submitted using the current version of the Checking Program. I do not guarantee that any version of the Checking Program will find all the errors in your work. Complete your assignments according to the specifications, not according to the incomplete set of the mistakes detected by the Checking Program.

check

  1. There is a Checking Program named assignment12check in the Source Directory on the CLS. You can execute this program by typing its (long) pathname into the shell as a command name and paginating the (often long) output using less:

    $ ~idallen/cst8207/16w/assignment12/assignment12check | less

    Create a symbolic link named check in your Base Directory that links to the Checking Program in the Source Directory, as you did in a previous assignment. Use the symlink to check your work:

    $ ./check | less

Checking only one of your scripts

Normally the Checking Program checks all the scripts. This can be slow if you are only interested in the check output for one script that you are working on. You can now check just one or more individual scripts by giving the script names as arguments to the checking program:

$ ./check upper.sh                      # only check this script
$ ./check upper.sh phone.sh             # only check these scripts

Do not submit for marking the output of checking only a few scripts!

  1. When you are done, execute the above Checking Program as a command line on the CLS. This program will check your work, assign you a mark, and display the output on your screen.

    You may run the Checking Program as many times as you wish, allowing you to correct mistakes and get the best mark. Some task sections require you to finish the whole section before running the Checking Program at the end; you may not always be able to run the Checking Program successfully after every single task step.

  2. When you are done with this assignment, and you like the mark displayed on your screen by the Checking Program, you must redirect only the standard output of the Checking Program into the text file assignment12.txt in your Base Directory on the CLS, like this:

    $ ./check >assignment12.txt
    $ less assignment12.txt
    • Use standard output redirection with that exact assignment12.txt file name.
    • Use that exact name. Case (upper/lower case letters) matters.
    • Be absolutely accurate, as if your marks depended on it.
    • Do not edit the output file; the format is fixed.
    • Make sure the file actually contains the output of the Checking Program!
    • The file should contain, near the bottom, a line starting with: YOUR MARK for
    • Really! MAKE SURE THE FILE HAS YOUR MARKS IN IT!
  3. Transfer the above single file assignment12.txt (containing the output from the Checking Program) from the CLS to your local computer.
    • You may want to refer to the File Transfer page for how to transfer the file.
    • Verify that the file still contains all the output from the Checking Program.
    • Do not edit or open and save this file on your local computer! Edited or damaged files will not be marked. Submit the file exactly as given.
    • The file should contain, near the bottom, a line starting with: YOUR MARK for
    • Really! MAKE SURE THE FILE YOU UPLOAD HAS YOUR MARKS IN IT!
  4. Upload the assignment12.txt file from your local computer to the correct Assignment area on Blackboard (with the exact name) before the due date:
    1. On your local computer use a web browser to log in to Blackboard and go to the Blackboard page for this course.
    2. Go to the Blackboard Assignments area for the course, in the left side-bar menu, and find the current assignment.
    3. Under Assignments, click on the underlined assignment12 link for this assignment.
      1. If this is your first upload, the Upload Assignment page will open directly; skip the next sentence.
      2. If you have already uploaded previously, the Review Submission History page will be open and you must use the Start New button at the bottom of the page to get to the Upload Assignment page.
    4. On the Upload Assignment page, scroll down and beside Attach File use Browse My Computer to find and attach your assignment12.txt file from your local computer. Make sure the assignment file has the correct name on your local computer before you attach it. Attach only your assignment12.txt file for upload. Do not attach any other file names.
    5. After you have attached the assignment12.txt file on the Upload Assignment page, scroll down to the bottom of the page and use the Submit button to actually upload your attached assignment12.txt file to Blackboard.
    6. Submit the file exactly as uploaded from the CLS.
    7. Do not submit an empty file. Do not submit any other file names.

    Use only Attach File, Browse My Computer on the Upload Assignment page. Do not enter any text into the Write Submission or Add Comments boxes on Blackboard; I do not read them. Use only the Attach File, Browse My Computer section followed by the Submit button. If you need to comment on any assignment submission, send me EMail.

    You can revise and upload the file more than once using the Start New button on the Review Submission History page to open a new Upload Assignment page. I only look at the most recent submission.

    You must upload the file with the correct name from your local computer; you cannot correct the name as you upload it to Blackboard.

  5. Verify that Blackboard has received your submission: After using the Submit button, you will see a page titled Review Submission History that will show all your uploaded submissions for this assignment. Each of your submissions is called an Attempt on this page. A drop-down list of all your attempts is available.
    1. Verify that your latest Attempt has the correct 16-character, lower-case file name under the SUBMISSION heading.
    2. The one file name must be the only thing under the SUBMISSION heading. Only the one file name is allowed.
    3. No COMMENTS heading should be visible on the page. Do not enter any comments when you upload an assignment.
    4. Click on the Download button to open and view the file you just uploaded. MAKE SURE THE FILE YOU JUST UPLOADED HAS YOUR MARKS IN IT!
    5. Save a screen capture of the Review Submission History page on your local computer, showing the single uploaded file name listed under SUBMISSION. If you want to claim that you uploaded the file and Blackboard lost it, you will need this screen capture to prove that you actually uploaded the file. (To date, Blackboard has never lost an uploaded file.)
    6. Make sure you have used Submit and not Save as Draft. I cannot mark draft assignments. Make sure you Submit.

    You will also see the Review Submission History page any time you already have an assignment attempt uploaded and you click on the underlined assignment12 link. You can use the Start New button on this page to re-upload your assignment as many times as you like.

    You cannot delete an assignment attempt, but you can always upload a new version. I only mark the latest version.

  6. Your instructor may also mark files in your directory in your CLS account after the due date. Leave everything there on the CLS. Do not delete any assignment work from the CLS until after the term is over!

READ ALL THE WORDS. OH PLEASE, PLEASE, PLEASE READ ALL THE WORDS!

Author: 
| Ian! D. Allen  -  idallen@idallen.ca  -  Ottawa, Ontario, Canada
| Home Page: http://idallen.com/   Contact Improv: http://contactimprov.ca/
| College professor (Free/Libre GNU+Linux) at: http://teaching.idallen.com/
| Defend digital freedom:  http://eff.org/  and have fun:  http://fools.ca/

Plain Text - plain text version of this page in Pandoc Markdown format

Campaign for non-browser-specific HTML   Valid XHTML 1.0 Transitional   Valid CSS!   Creative Commons by nc sa 3.0   Hacker Ideals Emblem   Author Ian! D. Allen