=======================================================
Regular Expressions - practice examples with commentary
=======================================================
-Ian! D. Allen idallen@idallen.ca

Here are examples of regular expressions ("REs" or "regexp").  Many of
these are "idioms" - things that people who use regexp a lot know and
recognize.

These examples use the Quigley Chapter 3 "datebook" file.  (Do not use
the Chapter 4 or Chapter 5 "datebook" files - they have been corrupted to
have DOS line ends and regexp will not properly match things at the ends
of lines in those files due to the extra carriage-return characters.)

The "datebook" file contains lines with fields delimited by colon
characters (':').

* Find addresses containing the letter 'z'.
  (Addresses are the third colon-delimited field in each line.)

    grep '^[^:]*:[^:]*:[^:]*z' datebook

  The idiom '[^X]*X' where 'X' is some delimiter, matches a field in a
  line ending with that delimiter.  In English, the idiom '[^X]*X' means:

     [^X]*X -> Match zero or more non-X characters, followed by X.
  
  You can repeat the idiom for multiple field matches.  The match works
  fine even for empty fields, since we match zero or more characters in
  each field.

  How might you match a letter 'z' in the 200th field in a line, without
  writing the above idiom out 199 times?  (Hint: Extended regexps have
  a syntax for grouping parts of REs and repeating a previous regexp,
  such as a group, a fixed number of times.  See the "Groups characters"
  example on p.96 and Example 4.3.3 on p.97.)

* Delete addresses containing the letter 'z'.
  (Addresses are the third colon-delimited field in each line.)

    sed -e 's/^\([^:]*:[^:]*:\)[^:]*z[^:*]*/\1/' datebook

  This uses the same idiom as above, to match the first two fields in the
  line, followed by a third field that contains a letter 'z' anywhere
  in the field.  We use parentheses \(,\) around the part of the match
  that we want to keep - that part (the first parenthesized expression)
  is substituted back into each line using the '\1' construct in the
  right-hand side of the substitution.  The part of the match outside
  the parentheses (the third field containing the letter 'z') is not
  substituted back into the line, effectively deleting it from the line.

  The left-hand side only matches lines that have a 'z' in the third
  field, so nothing happens at all on lines that don't match.  Those lines
  pass through sed unchanged.
  
  The default for sed is to output every line that it reads, even if no
  changes have been made.  To turn off the default behaviour and output
  only lines that have been changed, see the "-n" option to sed and the
  "p" option to the "s" command:

     sed -n -e 's/foo/bar/p'      # only outputs the line if changed

* Add "19" in front of the birth year for all lines.
  (Birth dates are the second-to-last colon-delimited field in each line.)

    sed -e 's/\([0-9][0-9]:[^:]*\)$/19\1/' datebook

  The birth year is the last two digits in the second-to-last field.
  We use a variant on the field idiom 'X[^X]*' where 'X' is some
  delimiter.  In English, the idiom 'X[^X]' means:

     X[^X]* -> Match X followed by zero or more non-X characters.

  If anchored to the end of the line (':[^:]*$'), this matches the
  colon and the contents of the last field in the line.  To match
  earlier fields, prepend more of the same idiom.  In our case, we
  want to match the last two digits at the end of the second-to-last
  field (two digits in front of the colon), so we use:  [0-9][0-9]:[^:]*$

  We surround the entire regexp with parentheses, so that the whole match
  is re-inserted into the line using the \1 syntax on the right hand side.
  In front of the whole thing, we put the "19" we need, so that the 19
  is substituted to appear just before the two digits in the year.

* Add "19" in front of the birth year for all lines starting with 'Fred'.
  (Birth dates are the second-to-last colon-delimited field in each line.)

    sed -e 's/^\(Fred.*\)\([0-9][0-9]:[^:]*\)$/\119\2/' datebook

  We add a parenthesized regexp that matches 'Fred' at the start of the
  line.  We also match everything ('.*') after Fred up to the regexp
  we already developed that matches the birth year.  If this entire
  regexp succeeds in matching, \1 will contain all the characters from
  'Fred' up to the birth year, and \2 will contain all the characters
  from the birth year to the end of the line.  Between these two
  matches, we want to insert the digits '19':  \119\2

* Find lines longer than 78 characters.

    Example 1:  grep '.\{78\}.' datebook

  English: find 78 characters followed by one more character.
  
  You could also do the math in your head and write this solution:

    Example 2:  grep '.\{79\}' datebook

  but then the regexp doesn't contain the original '78' in it.  Someone
  could do a global substitution changing '78' to '55' in your script,
  and it would fail to find and change the '79' to '56'.  Also, if we
  wanted to make the number an argument from the command line, the first
  way works directly:

    Example 1B:  grep '.\{'"$1"'\}.' datebook

  It's messy using the second example, since we have to add one:

    Example 2B (messy):

    num=$1
    let num=num+1
    grep '.\{'"$num"'\}' datebook

  Less code is better code!