======================================================= Regular Expressions - practice examples with commentary ======================================================= -Ian! D. Allen idallen@idallen.ca Here are examples of regular expressions ("REs" or "regexp"). Many of these are "idioms" - things that people who use regexp a lot know and recognize. These examples use the Quigley Chapter 3 "datebook" file. (Do not use the Chapter 4 or Chapter 5 "datebook" files - they have been corrupted to have DOS line ends and regexp will not properly match things at the ends of lines in those files due to the extra carriage-return characters.) The "datebook" file contains lines with fields delimited by colon characters (':'). * Find addresses containing the letter 'z'. (Addresses are the third colon-delimited field in each line.) grep '^[^:]*:[^:]*:[^:]*z' datebook The idiom '[^X]*X' where 'X' is some delimiter, matches a field in a line ending with that delimiter. In English, the idiom '[^X]*X' means: [^X]*X -> Match zero or more non-X characters, followed by X. You can repeat the idiom for multiple field matches. The match works fine even for empty fields, since we match zero or more characters in each field. How might you match a letter 'z' in the 200th field in a line, without writing the above idiom out 199 times? (Hint: Extended regexps have a syntax for grouping parts of REs and repeating a previous regexp, such as a group, a fixed number of times. See the "Groups characters" example on p.96 and Example 4.3.3 on p.97.) * Delete addresses containing the letter 'z'. (Addresses are the third colon-delimited field in each line.) sed -e 's/^\([^:]*:[^:]*:\)[^:]*z[^:*]*/\1/' datebook This uses the same idiom as above, to match the first two fields in the line, followed by a third field that contains a letter 'z' anywhere in the field. We use parentheses \(,\) around the part of the match that we want to keep - that part (the first parenthesized expression) is substituted back into each line using the '\1' construct in the right-hand side of the substitution. The part of the match outside the parentheses (the third field containing the letter 'z') is not substituted back into the line, effectively deleting it from the line. The left-hand side only matches lines that have a 'z' in the third field, so nothing happens at all on lines that don't match. Those lines pass through sed unchanged. The default for sed is to output every line that it reads, even if no changes have been made. To turn off the default behaviour and output only lines that have been changed, see the "-n" option to sed and the "p" option to the "s" command: sed -n -e 's/foo/bar/p' # only outputs the line if changed * Add "19" in front of the birth year for all lines. (Birth dates are the second-to-last colon-delimited field in each line.) sed -e 's/\([0-9][0-9]:[^:]*\)$/19\1/' datebook The birth year is the last two digits in the second-to-last field. We use a variant on the field idiom 'X[^X]*' where 'X' is some delimiter. In English, the idiom 'X[^X]' means: X[^X]* -> Match X followed by zero or more non-X characters. If anchored to the end of the line (':[^:]*$'), this matches the colon and the contents of the last field in the line. To match earlier fields, prepend more of the same idiom. In our case, we want to match the last two digits at the end of the second-to-last field (two digits in front of the colon), so we use: [0-9][0-9]:[^:]*$ We surround the entire regexp with parentheses, so that the whole match is re-inserted into the line using the \1 syntax on the right hand side. In front of the whole thing, we put the "19" we need, so that the 19 is substituted to appear just before the two digits in the year. * Add "19" in front of the birth year for all lines starting with 'Fred'. (Birth dates are the second-to-last colon-delimited field in each line.) sed -e 's/^\(Fred.*\)\([0-9][0-9]:[^:]*\)$/\119\2/' datebook We add a parenthesized regexp that matches 'Fred' at the start of the line. We also match everything ('.*') after Fred up to the regexp we already developed that matches the birth year. If this entire regexp succeeds in matching, \1 will contain all the characters from 'Fred' up to the birth year, and \2 will contain all the characters from the birth year to the end of the line. Between these two matches, we want to insert the digits '19': \119\2 * Find lines longer than 78 characters. Example 1: grep '.\{78\}.' datebook English: find 78 characters followed by one more character. You could also do the math in your head and write this solution: Example 2: grep '.\{79\}' datebook but then the regexp doesn't contain the original '78' in it. Someone could do a global substitution changing '78' to '55' in your script, and it would fail to find and change the '79' to '56'. Also, if we wanted to make the number an argument from the command line, the first way works directly: Example 1B: grep '.\{'"$1"'\}.' datebook It's messy using the second example, since we have to add one: Example 2B (messy): num=$1 let num=num+1 grep '.\{'"$num"'\}' datebook Less code is better code!