Regular Expressions

                      1
•   Our standard script header
•   Matching patterns
•   POSIX character classes
•   Regular Expressions
•   Character classes
•   Some Regular Expression gotchas
•   Regular Expression Resources
•   Assignment 3 on Regular Expressions

                                          2
#!/bin/sh -u
PATH=/bin:/usr/bin ; export PATH   # add /sbin and /usr/sbin if needed
umask 022                          # use 077 for secure scripts

                                                                         3
• There are two different pattern matching
  facilities that we use in Unix/Linux:
1. filename globbing patterns match existing
    pathnames in the current filesystem only
2. regular expressions match substrings in
    arbitrary input text

•   We need to pay close attention to which of
    the two situations we're in, because some of
    the same special characters have different
    meanings!

                                                   4
•   Globbing is used for
    ◦ globbing patterns in command lines
    ◦ patterns used with the find command
•   shell command line (the shell will match the
    patterns against the file system):
    ◦ ls *.txt
    ◦ echo ?????.txt
    ◦ vi [ab]*.txt
•   find command (we double quote the pattern so
    the find command sees the pattern, not the shell):
    ◦ find ~ -name "*.txt"
    ◦ in this case, the find command matches the pattern against
      the file system

                                                                   5
•   IMPORTANT: regular expressions use some of
    the same special characters as filename
    matching on the previous slide but they mean
    different things!
•   Regular expressions can be used in awk,
    grep, vi, sed, more, less, and many
    email server applications.

                                                   6
•   Before we look at regular expressions, let's
    take a look at some expressions you're
    already comfortable with: algebraic
    expressions
•   Larger algebraic expressions are formed by
    putting smaller expressions together

                                                   7
Expression   Meaning             Comment
a            a                   a simple expression
b            b                   another simple expression
ab           axb                 ab is a larger expression formed from
                                 two smaller ones
                                 concatenating two expressions
                                 together means to multiply them
b2           bxb                 we might have represented this with
                                 b^2, using ^ as an exponentiation
                                 operator
ab2          a x (b x b)         why not (a x b) x (a x b)?

(ab)2        (a x b) x (a x b)

                                                                         8
•   [:alnum:] alphanumeric characters
•   [:alpha:] alphabetic characters
•   [:blank:] space, tab
•   [:cntrl:] control characters
•   [:digit:] digit characters
•   [:lower:] lower case alphabetic characters
•   [:print:] visible characters, plus [:space:]
•   [:punct:] Punctuation characters and other symbols
    ◦ !"#$%&'()*+,\-./:;<=>?@[]^_`{|}~
•   [:space:] White space (space, tab)
•   [:upper:] upper case alphabetic characters
•   [:xdigit:] Hexadecimal digits
•   [:graph:] Visible characters (anything except spaces
    and control characters)
                                                           9
Expression   Meaning                  Comment
a            match single 'a'         a simple expression
b            match single 'b'         another simple expression
ab           match strings            "ab" is a larger expression formed
             consisting of single     from two smaller ones
             'a' followed by          concatenating two regular
             single 'b'               expressions together means
                                      "followed immediately by" and we'll
                                      say "followed by"
b*           match zero or            a big difference in meaning from the
             more 'b' characters      '*' in globbing! This is the regular
                                      expression repetition operator.
ab*          'a' followed by zero why not repeating ('a' followed by 'b'),
             or more 'b'          zero or more times? Hint: think of
             characters           "ab2" in algebra.
\(ab\)*      ('a' followed by 'b'),   We can use parenthesis, but in Basic
             zero or more times       Regular Expressions, we use \( and \)

                                                                              10
Expression    Matches          Ex.   Example        Comment
                                     Matches
non-special itself             x     "x"            like globbing
character
one           first followed   xy    "xy"           like globbing
expression    by second
followed by
another
.             any single       .     "x" or "y"    like the '?' in globbing
              character              or "!" or "."
                                     or "*"
                                     …etc
expression    zero or more     x*    "" or "x" or   NOT like the * in
followed by   matches of the         "xx" or        globbing, although .*
*             expression             "xxx"          behaves like * in
                                     …etc           globbing
character     a SINGLE         [abc] "a" or "b"     like globbing
classes       character from         or "c"
              the list
                                                                              11
Expression   Matches          Ex.   Example    Comment
                                    Matches
^            beginning of a   ^x    "x" if it’s anchors the match to
             line of text           the first   the beginning of a
                                    character   line
                                    on the line
$            end of a line of x$    "x" if it's anchors the match to
             text                   the last    the end of a line
                                    character
                                    on the line
^ (but not   ^                a^b   "a^b"      ^ has no special
first)                                         meaning unless its
                                               first
$ (but not   $                a$b   "a$b"      $ has no special
last)                                          meaning unless its
                                               last

                                                                       12
Expression     Matches          Ex.   Example   Comment
                                      Matches
special        as if the        [\]   "\"       conditions: ']' must
character      character is                     be first, '^' must not
inside [ and   not special                      be first, and '-' must
]                                               be last
\ followed   that character     \.    "."       like globbing
by a special with its special
character    meaning
             removed
\ followed     the non-         \a    "a"       \ before a non-
by non-        special                          special character is
special        character                        ignored
character

                                                                         13
•   testing regular expressons with grep on stdin
    ◦ run grep --color=auto 'expr'
    ◦ use single quotes to protect your expr from the
      shell
    ◦ grep will wait for you to repeatedly enter your test
      strings (type ^D to finish)
    ◦ grep will print any string that matches your expr,
      so each matched string will appear twice (once
      when you type it, and once when grep prints it)
    ◦ the part of the string that matched will be colored
    ◦ unmatched strings will appear only once where you
      typed them

                                                             14
•   For now, we'll use grep on the command line
•   We will get into the habit of putting our regex
    in single quotes on the command line to
    protect the regex from the shell
•   Special characters for basic regular
    expressions: \, [, ], ., *, ^, $
•   can match single quote by using double
    quotes, as in : grep "I said, \"don't\""
•   alternatively: grep 'I said, "don'\''t"'

                                                      15
•   Appendix A in the Sobell Text book is a
    source of information
•   You can read under REGULAR EXPRESSIONS
    in the man page for the grep command - this
    tells you what you need to know
•   The grep man page is normally available on
    Unix systems, so you can use it to refresh
    your memory, even years from now

                                                  16
•   examples (try these)
    ◦   grep   ‘ab’     #any string with a followed by b
    ◦   grep   ‘aa*b’   #one or more a followed by b
    ◦   grep   ‘a..*b’ #a, then one or more anything, then b
    ◦   grep   ‘a.*b’  #a then zero or more anything, then b
    ◦   grep   ‘a.b’   # a then exactly one anything, then b
    ◦   grep   ‘^a’    # a must be the first character
    ◦   grep   ‘^a.*b$’ # a must be first, b must be last
•   Try other examples: have fun!

                                                               17
•   Character classes are lists of characters inside
    square brackets
•   The work the same in regex as they do in
    globbing
•   Character class expressions always match
    EXACTLY ONE character (unless they are
    repeated by appending '*')
•   [azh] matches "a" or "h" or "z"

                                                       18
•   Non-special characters inside the square
    brackets form a set (order doesn't matter,
    and repeats don’t affect the meaning):
    ◦ [azh] and [zha] and [aazh] are all equivalent
•   Special characters lose their meaning when
    inside square brackets, but watch out for ^,
    ], and – which do have special meaning
    inside square brackets, depending on where
    they occur

                                                      19
•   ^ inside square brackets makes the character
    class expression mean "any single character
    UNLESS it's one of these"
•   [^azh] means "any single character that is
    NOT a, z, or h"
•   ^ has its special "inside square brackets"
    meaning only if it is the first character inside
    the square brackets
•   [a^zh] means a, h, z, or ^
•   Remember, leading ^ outside of square
    brackets has special meaning "match
    beginning of line"

                                                       20
•   ] can be placed inside square brackets but it
    has to be first (or second if ^ is first)
•   []azh] means ], a, h, or z
•   [^]azh] means "any single character that is
    NOT ], a, h, or z"
•   Attempting to put ]inside square brackets in
    any other position is a syntax error:
    ◦ [ab]d] is a failed attempt at [ab][d]
    ◦ [] is a failed attempt at []]

                                                    21
•   - inside square brackets represents a range
    of characters, unless it is first or last
•   [az-] means a, z, or -
•   [a-z] means any one character between a
    and z inclusive (but what does that mean?)
•   "Between a and z inclusive" used to mean
    something, because there was only one locale
•   Now that there is more than one locale, the
    meaning of "between a and z inclusive" is
    ambiguous because it means different things
    in different locales

                                                   22
•   i18n basically means "support for more than one locale"
•   Not all computer users use the same alphabet
•   When we write a shell script, we want it to handle text and filenames
    properly for the user, no matter what language they use
•   In the beginning, there was ASCII, a 7 bit code of 128 characters
•   Now there’s Unicode, a table that is meant to assign an integer to
    every character in the world
•   UTF-8 is an implementation of that table, encoding the 7-bit ASCII
    characters in a single byte with high order bit of 0
•   The 128 single-byte UTF-8 characters are the same as true ASCII
    bytes (both have a high order bit of 0)
•   UTF-8 characters that are not ASCII occupy more than one byte, and
    these give us our accented characters, non-Latin characters, etc
•   Locale settings determine how characters are interpreted and
    treated, whether as ASCII or UTF-8, their ordering, and so on

                                                                            23
•   A locale is the definition of the subset of a user's environment that
    depends on language and cultural conventions.
•   For example, in a French locale, some accented characters qualify as
    'lower case alphabetic", but in the old "C" locale, ASCII a-z contains
    no accented characters.
•   Locale is made up from one or more categories. Each category is
    identified by its name and controls specific aspects of the behavior
    of components of the system.
•   Category names correspond to the following environment variable
    names (the first three especially can affect the behavior of our shell
    scripts):
    ◦   LC_ALL: Overrides any individual setting of the below categories.
    ◦   LC_CTYPE: Character classification and case conversion.
    ◦   LC_COLLATE: Collation order.
    ◦   LC_MONETARY: Monetary formatting.
    ◦   LC_NUMERIC: Numeric, non-monetary formatting.
    ◦   LC_TIME: Date and time formats.
    ◦   LC_MESSAGES: Formats of informative and diagnostic messages and interactive
        responses.

                                                                                      24
$   export LC_ALL=C
$   echo *
A   B C Z a b c z
$   echo [a-z]*
a   b c z
$   export LC_ALL=en_CA.UTF-8
$ echo *
A a B b C c Z z
$ echo [a-z]*
a B b C c Z z
$

                                25
•   Do not use ranges in bracket expressions
•   We now use special symbols to represent the
    sets of characters that we used to represent
    with ranges.
•   These all start with [: and end with :]
•   For example lower case alphabetic characters
    are represented by the symbol [:lower:]
    ◦ [[:lower:]] matches any lower case alpha char
    ◦ [AZ[:lower:]12] matches A, Z, 1, 2, or any
      lower case alpha char

                                                      26
•   [:alnum:] alphanumeric characters
•   [:alpha:] alphabetic characters
•   [:blank:] space, tab
•   [:cntrl:] control characters
•   [:digit:] digit characters
•   [:lower:] lower case alphabetic characters
•   [:print:] visible characters, plus [:space:]
•   [:punct:] Punctuation characters and other symbols
    ◦ !"#$%&'()*+,\-./:;<=>?@[]^_`{|}~
•   [:space:] White space (space, tab)
•   [:upper:] upper case alphabetic characters
•   [:xdigit:] Hexadecimal digits
•   [:graph:] Visible characters (anything except spaces
    and control characters)
                                                           27
•   POSIX character classes go inside […]
•   examples
    ◦ [[:alnum:]] matches any alphanumeric character
    ◦ [[:alnum:]}] matches one alphanumeric or }
    ◦ [[:alpha:][:cntrl:]] matches one alphabetic or
      control character
•   Take NOTE!
    ◦ [:alnum:] matches one of a,:,l,n,u,m (but grep on
      the CLS will give an error by default)
    ◦ [abc[:digit:]] matches one of a,b,c, or a digit

                                                          28
•   The exact content of each character class
    depends on the local language.
•   Only for plain ASCII is it true that "letters"
    means English a-z and A-Z.
•   Other languages have other "letters", e.g. é, ç,
    etc.
•   When we use the POSIX character classes, we
    are specifying the correct set of characters for
    the local language as per the POSIX
    description

                                                       29
•   Remember any match will be a long as
    possible
    ◦ aa* matches the aaa in xaaax just once, even
      though you might think there are three smaller
      matches in a row
•   Unix/Linux regex processing is line based
    ◦ our input strings are processed line by line
    ◦ newlines are not considered part of our input string
    ◦ we have ^ and $ to control matching relative to
      newlines

                                                             30
•   expressions that match zero length strings
    ◦ remember that the repetition operator * means
      "zero or more"
    ◦ any expression consisting of zero or more of
      anything can also match zero
    ◦ For example, x*, "meaning zero or more x
      characters", will match ANY line, up to n+1 times,
      where n is the number of (non-x) characters on that
      line, because there are zero x characters before and
      after every non-x character
    ◦ grep and regexpal.com cannot highlight matches
      of zero characters, but the matches are there!

                                                             31
•   quoting (don't let the shell change regex
    before grep sees the regex)
$ mkdir empty
$ cd empty
$ grep [[:upper:]] /etc/passwd   | wc
   503 2009 39530
$ touch Z
$ grep [[:upper:]] /etc/passwd   | wc
     7     29   562
$ touch A
$ grep [[:upper:]] /etc/passwd   | wc
    87     343 7841
$ chmod 000 Z
$ grep [[:upper:]] /etc/passwd   | wc
grep: Z: Permission denied
    87     343 7841

                                                32
•   To explain the previous slide, use echo to
    print out the grep command you are actually
    running:

$ echo grep [[:upper:]] /etc/passwd
grep A Z /etc/passwd

$ rm ?

$ echo grep [[:upper:]] /etc/passwd
grep [[:upper:]] /etc/passwd

                                                  33
•   we will not use range expressions
•   we'll standardize on en_CA.UTF-8 so that the
    checking script for assignments always sees
    things formatted the same way

                                                   34
•   http://www.regular-
    expressions.info/tutorial.html
•   http://lynda.com
•   http://regexpal.com
•   http://teaching.idallen.com/cst8177/14w/no
    tes/000_character_sets.html
•   http://www.regular-
    expressions.info/posixbrackets.html

                                                 35
•   Some students are already comfortable with
    the command line
•   For those who aren't, yet another tutorial
    source that might help is Lynda.com
•   All Algonquin students have free access to
    Lynda.com
•   Unix for Mac OSX users:
http://www.lynda.com/Mac-OS-X-10-6-tutorials/Unix-for-Mac-OS-X-
Users/78546-2.html

                                                                  36
•   Lynda.com has a course on regular expressions
•   The problem is that it covers our material as well as some
    more advanced topics that we won't cover
•   It is a good presentation, and the following chapters should
    have minimal references to the "too advanced" material
    ◦ Chapter 2 Characters
    ◦ Chapter 3 Character Sets
    ◦ Chapter 4 Repetition Expressions
•   On campus use this URL:
http://www.lynda.com/Regular-Expressions-tutorials/Using-Regular-
Expressions/85870-2.html
•   Off campus use this URL:
http://wwwlyndacom.rap.ocls.ca/Regular-Expressions-
tutorials/Using-Regular-Expressions/85870-2.html

                                                                    37
•   Assignment 3 asks you to write shell scripts
•   These are simple scripts: just the script header,
    and a grep command where coming up with the
    regex is your work to be done
•   You don't need extended regular expression
    functionality, and the checking script will disallow
    it

                                                           38