RegExp Skills

Regular Expression Skill Assessment

Here are some descriptions of text manipulation problems of varying levels of difficulty. Ths skills to do these problems come from Appendix A in your Unix text. These are all example of text and data manipulation. Some problems may be solved using Unix utilities that don't use regular expressions. Many of the problems require more than one Unix utility, or the same utility used repeatedly.

To succeed in becoming a Web Programmer, you must be able to do all the Elementary manipulations given here. You must be able to do most of the Basic manipulations.

Elementary

Change the letters "dog" to "HORSE" everywhere it occurs on all lines.
Change all occurrences of the letters "Man" at the beginning of a line to "Person".
Change all occurrences of "stick" followed by any punctuation at the end of a line to "Stick.". (The punctuation is replaced by a period.)
Change all occurrences of "Dog" or "dog" to "COW".
Change all Canadian or American spellings of colour (color) to "Color".
Double all vowels in every word on every line.
Triple the amount of space between every word.
Find and print lines that contain "dog" followed by any number of digits then "cat".
Find and print lines that contain the letters "dog" followed anywhere by the letters "cat".
Change all occurrences of one or more digits to the single word "NUMBER".
Replace all occurrences of one or more blanks with a single blank.
Replace all occurrences of one or more tabs or blanks with a single blank.
Remove the first 8 characters from every line.
Remove all leading blanks or tabs from all lines.
Remove all trailing blanks or tabs from all lines.
Replace all tab characters with eight spaces.
Change all punctuation so that the sentence period lies outside of the closing double quote, e.g. "Hello there." becomes "Hello there".
Remove everything leading up to and including the last blank on each line.
Remove everything including and after the first blank on each line.
Put double quotes around every occurrence of the phrase "user-friendly".

Basic

Add an extra blank after every period at the end of a sentence.
Make sure that every period at the end of a sentence is followed by exactly two blanks.
Truncate every line to ten characters.
Exchange the first 10 characters with the next 15 characters on every line.
Exchange the first number with the second number on every line.
Remove all leading zeroes from the first number on each line. Don't mishandle single digit zeroes.
Find and print lines that contain all the vowels in alphabetical order, a before e before i before o before u. Test using /usr/dict/words.
Find and print lines that contain all the vowels in any order. Test using /usr/dict/words.
Change all occurrences of one or more digits surrounded by spaces to the word "NUMBER" also surrounded by spaces.
Change only the second occurrence of a single blank to a colon in each line.
Change the only the second-to-last single blank to a colon in each line.
Change only the second occurrence of a string of one or more blanks to a colon in each line.
Change only the second-to-last occurrence of a string of one or more blanks to a colon in each line.
Remove all occurrences of HTML tags whose open and closing angle brackets are on the same line (e.g. <BR>, <TABLE>, <A HREF="...">, etc.). Remove all of them, not just the first ones.
Remove everything on every line that appears between double quotes, leaving only the quotes. (Example: a "bcd" efg "h i" j --> a "" efg "" j ) Handle empty strings (adjacent quotes) correctly.
Find lines that contain only one single quote character (an unmatched quote).
Put double quotes around every occurrence of the phrase "user-friendly", unless the phrase already has double quotes around it.
Find all numbers prefixed by a dollar sign, remove the dollar sign, and suffix the number with "CDN", e.g. $123.45 becomes 123.45CDN. Now do the reverse.
Find all numbers with periods separating decimals and change the periods to commas, e.g. 123.45 becomes 123,45. Now do the reverse.
Find all numbers with commas separating sets of three digits and change all the commas to spaces, e.g. 1,234,567.23 becomes 1 234 567.23. (You may assume the only use of a comma immediately followed by three digits is as a separator.)
Locate common misspellings and mistypings of "@algonquincollege.com" and fix them all. (e.g. fix algonqinc.ont.can, etc.)
Find all occurrences of your name with or without initials and embedded spaces. (e.g. "Ian D. Allen", "Ian Allen", "I. D. Allen", "ID Allen", "IDAllen", "iallen", etc.) Try to minimize false hits in the middle of words. (e.g. fallen, challenge, Wallenstein, etc.)
Remove either single or double quotes from around all strings of one or more digits, e.g. "10" or '10' become just 10. Now do the reverse (add quotes to all numbers).
Locate hexadecimal numbers having the form "0xA0FF2375C3" and prefix them with the string "(HEX:)", e.g. 0xDEAD would appear as (HEX:)0xDEAD and 0xBEAD00BEAD00 would appear as (HEX:)0xBEAD00BEAD00. Now do the reverse (remove the prefixes).
Use a single regular expression to change every occurrence of the word "dog" to be "dog-eat-dog" and "cat" to be "cat-eat-cat". Now do the reverse.
Produce a plain list of mail addresses and home pages for everyone with an account on this system.
Write a script that will perform a simple substitution on the contents of each of the files given on the command line, e.g. $ ./script 's/dog/cat/g' *.txt

Advanced

Have every new sentence in a document start at the beginning of a line. (Insert newline characters at the end of every sentence.)

Find and print lines where all vowels are in strict alphabetical order, i.e. no e precedes an a, no i precedes an e, no o precedes an i, etc. All vowels that appear are in alphabetical order in the input, from left to right. Test your expression on /usr/dict/words.

Change the second and all subsequent occurrences of one or more blanks to single blanks. (The first occurrence of a string of blanks is untouched.)

A file has a large number of columns of numbers separated by blanks. Change every second string of blanks to a colon. (A line of output might appear thus: 12 34:56 78:90 12) You don't know how many columns are in the input files.

Exchange the first number with the last number on every line.

Remove all leading zeroes from all numbers on each line. Don't mishandle a single digit zero.

Produce an HTML table of active links (the links are clickable) to mail addresses and home pages for everyone with an account on this system. Include the full names of the people with the accounts.

Turn any text file into an approximation of Pig Latin. (For examples of Pig Latin, see: Club Girl Tech's Pig Latin Translator and Pig Latin Page or Pig Latin Converter) [I don't know if regular expressions can do this with 100% accuracy; but, even 90% will be amusing to read.] See also: C Language Pig Latin program source

Write a script that will rename files according to a sed substitute pattern given as the first argument to the script, for example:

$ ./rename 's/txt$/dat/' *.txt

would take all the file names in the current directory that end in ".txt" and rename them to end in ".dat". What pattern would you use to rename files with names such as "file1.day.mon" to "file1.mon.day", e.g. "foo.31.01" would become "foo.01.31" and "bar.30.12" would become "bar.12.30", etc.?

Write a script that will convert alphabetic dated file names to numeric names, e.g. a file named "Mar.12.99" would be renamed "1999-03-12" and "Jul.31.54" would become "1954-07-31". Make sure your script doesn't overwrite any existing files. Does your script handle all the possible forms of each month name, e.g. "Mar", "MAR", "mar", "March", "MARCH", "march"? (Hint: You can use multiple "-e" options to sed.) (p.s. Does your script handle the year 2000?)

Advanced script/regexp problem: Write a script that will generate and execute a sed expression that will do a substitution on the Nth occurrence of a string. The N is given as the first argument. The string and replacement are given as the next two arguments. The script will process lines from standard input. For example:

       $ echo aaaaaaaaaa | ./script 8 'a' 'b'
       aaaaaaabaa

You can make some simplifying assumptions to make it easier:

	the string and replacement won't contain slashes
	the string and replacement won't contain any blanks or other shell metacharacters or special characters
	the string and replacement won't contain any regular expression characters, e.g. . [ ] *

Super advanced problem: Solve the same problem; but, remove the simplifying restrictions. Removing all the restrictions is hard. You may have to extensively pre-process the string and the replacement to protect embedded special characters. Removing some of the restrictions (e.g. blanks) is not too hard; but, you may need to know more advanced shell. Hint: See the "eval" built-in shell command. ("man sh", "man bash")
```
        sh$ x="'nested quoted string'"
        sh$ ./argprint $x
        ['nested]
        [quoted]
        [string']
        sh$ eval ./argprint $x
        [nested quoted string]

        sh$ y="This is y"
        sh$ x='$y'
        sh$ echo $x
        $y
        sh$ eval echo $x
        This is y
   
```

Additional Resources

See the entries for Regular Expressions in the FastTrack Resources page.