% Regular Expressions -- matching patterns and replacing text
% Ian! D. Allen -- <idallen@idallen.ca> -- [www.idallen.com]
% Winter 2016 - January to April 2016 - Updated 2016-10-28 17:25 EDT

-   [Course Home Page]
-   [Course Outline]
-   [All Weeks]
-   [Plain Text]

Matching Patterns: GLOB vs. Regular Expressions
===============================================

There are two different pattern matching facilities that we use in
Unix/Linux: **GLOB patterns** and **[Regular Expressions]**.

Regular Expressions are another way to match patterns in text, similar to but
more powerful than simple GLOB patterns.

Pay close attention to which of the two situations you're in, because some of
the same special characters common to GLOB and Regular Expressions have
different meanings!

GLOB patterns (review)
----------------------

There are several major places where GLOB patterns are used:

### File GLOB in the Shell: `*.txt`

In the shell, GLOB patterns may be used to match existing pathnames in the
file system:

    $ ls *.txt
    $ echo ?????.txt
    $ touch [ab]*.txt

The shell tries to expand the GLOB to match existing pathnames before the
associated command runs.

### `case` statement GLOB in the Shell

GLOB patterns are used in shell `case` statements to match the text at the
top of the `case` statement:

    case "$1" in
    /* ) type='Absolute Pathname' ;;
    *  ) type='Relative Pathname' ;;
    esac

### GLOB in the `find` command: `-name '*.txt'`

The `find` command `--name` operator also matches GLOB patterns against the
file system, but it does so recursively in every directory, not just in one
directory:

    $ find . -name '*.txt'
    $ find . -name '?????.txt'
    $ find . -name '[ab]*.txt'

We quote the patterns above to hide them from the shell so that the `find`
command receives the pattern and the shell doesn't try to expand them.

Regular Expressions -- Basic and Extended
-----------------------------------------

**[Regular Expressions]** (short form: *regexp*) are text matching patterns
similar to GLOB patterns but more powerful. Regexp patterns use all the GLOB
pattern matching characters and add more. The characters work slightly
differently between GLOB and regexp.

Regexp are used by many Unix/Linux programs and programming languages such as
`grep`, `sed`, `awk`, `vim`, `less`, `more`, `man`, `Perl`, `python`, etc.

In an editor (such as `vim` or `sed`), a Regular Expression may be used to
select characters to be deleted, replaced, or exchanged:

    :%s/colou*r/COLOUR/g                   # vim replacement regular expression

    $ echo "Colouur bad.  Colour red.  Color tan." | sed -e 's/Colou*r/COLOUR/g'
    COLOUR bad.  COLOUR red.  COLOUR tan.

Regexp have a **Basic** set of pattern matching characters and an
**Extended** set of characters. The `grep` program family is a very popular
user of both **Basic** and **Extended** Regular Expressions.

The `grep` command itself accepts **Basic** Regular Expression syntax, and
needs backslashes in front of some operators to access **Extended** Regular
Expression features. The `egrep` command accepts **Extended** Regular
Expression syntax and does not need the backslashes. You can do the same text
search using either command, but the syntax changes:

    $ grep 'publickey for \(idallen\|cst8207[abc]\?\)' /var/log/auth.log   # Basic
    $ egrep 'publickey for (idallen|cst8207[abc]?)' /var/log/auth.log      # Extended

From the section `REGULAR EXPRESSIONS` in the man page for the `grep`
command:

    Basic vs Extended Regular Expressions
      In basic regular expressions the meta-characters ?, +, {, |,
      (, and ) lose their special meaning; instead use the backslashed
      versions \?, \+, \{, \|, \(, and \).

Even the `bash` shell has extended syntax that allows the use of regular
expressions instead of simple GLOB patterns.

> **IMPORTANT:** Regular Expressions use some of the same special characters
> as GLOB patterns, but they mean different things! In particular, `*`, `?`,
> and `.` work differently! There are others!

GLOB patterns are *anchored*; Regular Expressions *float*
---------------------------------------------------------

GLOB patterns are said to be **anchored** to the start and end of the line;
they must always match the entire text string (usually a file name) from the
start to the end.

The GLOB pattern `a*b` matches only text that starts with `a` and ends with
`b` -- that GLOB pattern doesn't match just the `ab` in the middle of
`xxxabxxx`.

The modified GLOB pattern `*a*b*` now matches the whole text that *contains*
`a` followed by `b` anywhere in the text. The modified GLOB pattern *does*
match the entire text `xxxabxxx`.

Regular Expressions are *not* by default anchored. They "float" down the text
and they may match *anywhere* in the text string unless you explicitly anchor
them to either the start or end of the text using using regexp characters `^`
and/or `$`.

The Regular Expression `a.*b` matches inside any text that *contains* `a`
followed by `b` anywhere in the text. The floating regexp *does* match the
`ab` in the middle of `xxxabxxx`.

The modified Regular Expression `^a.*b$` is now **anchored** to the start and
end of the text. The modified expression now matches exactly the same text as
the GLOB pattern `a*b` because it forces the `a` to match at the start and
the `b` to match at the end. It does *not* match inside `xxxabxxx`.

You must remember to anchor the ends of your Regular Expressions if you want
to be sure that they match the *whole* piece of text and not just some part
of the text.

Summary:

-   Unanchored regexp `a*b` matches (only) the text `ab` inside `xxxabxxx`.
-   Anchored regexp `^a*b$` does not match the `ab` inside `xxxabxxx` because
    the `a` has to be at the start and the `b` has to be at the end. It does
    match the string `aaaaab`.

Regular Expressions compared with Algebraic Expressions
=======================================================

Like algebraic expressions, more complex Regular Expressions are built up by
combining simpler expressions. Regular Expressions have operators similar to
algebraic operators, but they mean different things than in algebra. Like
algebraic operators, Regular Expression operators have bindings and
precedence when combined with other operators.

Before we look at Regular Expressions, let's take a look at some Algebraic
Expressions you're already comfortable with. Larger Algebraic Expressions are
formed by putting smaller expressions together:

::: allbox
+------------------------+------------------------+------------------------+
| Expression             | Meaning                | Comment                |
+========================+========================+========================+
| a                      | a                      | a simple expression    |
+------------------------+------------------------+------------------------+
| b                      | b                      | another simple         |
|                        |                        | expression             |
+------------------------+------------------------+------------------------+
| ab                     | a x b                  | ab is a larger         |
|                        |                        | expression formed from |
|                        |                        | two smaller ones       |
|                        |                        |                        |
|                        |                        | concatenating two      |
|                        |                        | expressions together   |
|                        |                        | means to multiply them |
+------------------------+------------------------+------------------------+
| b^2^                   | b x b                  | we might have          |
|                        |                        | represented this with  |
|                        |                        | b\^2, using \^ as an   |
|                        |                        | exponentiation         |
|                        |                        | operator               |
+------------------------+------------------------+------------------------+
| ab^2^                  | a x (b x b)            | not (a x b) x (a x b)  |
+------------------------+------------------------+------------------------+
| (ab)^2^                | (a x b) x (a x b)      | parentheses for        |
|                        |                        | grouping               |
+------------------------+------------------------+------------------------+

: Algebraic Expressions
:::

Basic Regular Expressions using `*` repetition (zero or more) and parentheses
-----------------------------------------------------------------------------

Similar to an algebraic exponent, the asterisk/star `*` Regular Expression
operator binds tightly to the immediately preceding Regular Expression and
repeats it zero or more times. Parentheses (a feature of Extended Regular
Expressions) can be used for grouping, e.g.

    $ grep 'suc*eed' document.txt        # find sueed, suceed, succeed, succceed, etc.
    $ grep 'Bar\(bar\)*a' document.txt   # find Bara, Barbara, Barbarbara, etc.
    $ egrep 'Bar(bar)*a' document.txt    # use egrep Extended regexp syntax

> Rhabarbara: <https://www.youtube.com/watch?v=dD2mhVc6C_8>

Parentheses need backslashes in front of them when using a program such as
`grep` that uses **Basic** Regular Expression syntax. The `egrep` program
accepts **Extended** Regular Expression syntax and does not need the
backslashes.

::: allbox
+------------------------+------------------------+------------------------+
| Expression             | Meaning                | Comment                |
+========================+========================+========================+
| `a`                    | match single 'a'       | a simple expression    |
+------------------------+------------------------+------------------------+
| `b`                    | match single 'b'       | another simple         |
|                        |                        | expression             |
+------------------------+------------------------+------------------------+
| `ab`                   | match strings          | "ab" is a larger       |
|                        | consisting of single   | expression formed from |
|                        | 'a' followed by single | two smaller ones       |
|                        | 'b'                    |                        |
|                        |                        | concatenating two      |
|                        |                        | regular expressions    |
|                        |                        | together means         |
|                        |                        | "followed immediately  |
|                        |                        | by" and we'll say      |
|                        |                        | "followed by"          |
+------------------------+------------------------+------------------------+
| `b*`                   | match zero or more 'b' | a big difference in    |
|                        | characters             | meaning from the '*'  |
|                        |                        | in globbing! This is   |
|                        |                        | the regular expression |
|                        |                        | repetition operator.   |
+------------------------+------------------------+------------------------+
| `ab*`                  | 'a' followed by zero   | why not repeating the  |
|                        | or more 'b' characters | two characters 'ab'    |
|                        |                        | zero or more times?    |
|                        |                        | Hint: think of "ab^2^" |
|                        |                        | in algebra.            |
+------------------------+------------------------+------------------------+
| `\(ab\)*`              | ('a' followed by 'b'), | We can use             |
|                        | zero or more times     | parenthesis; in Basic  |
|                        |                        | Regular Expressions,   |
|                        |                        | we use `\(` and `\)`   |
+------------------------+------------------------+------------------------+

: Regular Expressions using `*` repetition (zero or more) and parentheses
:::

Concatenating and repeating Regular Expressions using `*` and `\(...\)`
-----------------------------------------------------------------------

As with algebraic multiplication, there is no operator to concatenate Regular
Expressions to match longer strings. Simple write one expression and follow
it with the next one.

Similar to an algebraic exponent, the asterisk/star `*` Regular Expression
operator binds tightly to the immediately preceding Regular Expression and
repeats it zero or more times. Parentheses can be used for grouping, e.g.

::: allbox
  -------------------------------------------------------------------------------
  Expression       Matches          Example     Example Matches  Comment
  ---------------- ---------------- ----------- ---------------- ----------------
  one expression   first followed   `xy`        "xy"             like globbing
  followed by      by second                                     
  another                                                        

  expression       zero or more     `x*`        "" or "x" or     NOT like the `*`
  followed by `*`  matches of the               "xx" or "xxx"    in globbing,
                   immediately                  ...etc           although `.*`
                   preceding                                     behaves like `*`
                   expression                                    in globbing

  expression in    the expression   `\(ab\)`    "ab"             parentheses are
  parentheses                                                    used for groups

  expression in    the expression   `\(ab\)*`   "" or "ab" or    parentheses are
  parentheses,     repeated zero or             "abab" or        used for groups
  followed by `*`  more times                   "ababab", etc.   
  -------------------------------------------------------------------------------

  : Concatenating and repeating Regular Expressions using `*` and `\(...\)`
:::

Special Characters in Basic Regular Expressions
-----------------------------------------------

Regular Expressions have more special characters than GLOB patterns. Some
special characters need backslashes in front of them to enable them in
**Basic** Regular Expressions.

::: allbox
  ------------------------------------------------------------------------------
  Character        Matches          Example    Example Matches  Comment
  ---------------- ---------------- ---------- ---------------- ----------------
  non-special      itself           `x`        "x"              like globbing
  character                                                     

  `.`              any single       `.`        "x" or "y" or    like the '?' in
                   character                   "!" or "." or    globbing
                                               "*" ...etc      

  `^` *used at     beginning of a   `^x`       "x" if it's the  anchors the
  start of regexp* line of text                first character  match to the
                                               on the line      beginning of a
                                                                line

  `^` *when not    `^` *(itself)*   `a^b`      "a\^b"           \^ has no
  used at start of                                              special meaning
  regexp*                                                       unless its first

  `$` *at end of   end of a line of `x$`       "x" if it's the  anchors the
  regexp*          text                        last character   match to the end
                                               on the line      of a line

  `$` *when not    `$` *(itself)*   `a$b`      "a$b"           $ has no
  used at end of                                                special meaning
  regexp*                                                       unless its last

  `\` followed by  that character   `\.`       "."              like globbing
  a special        with its special                             
  character        meaning removed                              

  `\` followed by  the non-special  `\a`       "a"              \\ before a
  a non-special    character (no                                non-special
  character        change)                                      character is
                                                                ignored

  `[` and `]`      character class  `[abc]`    "abc"            see Class below
  ------------------------------------------------------------------------------

  : Special Characters in Basic Regular Expressions
:::

Regular Expressions match anywhere in a line: anchoring with `^` and `$`
------------------------------------------------------------------------

GLOB Patterns are said to be **anchored** to the start and end of the string
being matched. The GLOB pattern `a*b` matches text `axb` but not `abx` or
`xab`. The `a` has to be at the start, and the `b` has to be at the end.

To allow a GLOB pattern to be *unanchored* and match anywhere inside a
string, you need to pad the GLOB with `*` on both sides:

    $ echo a*b                  # anchored: matches axb not abx or xab
    $ echo *a*b*                # now matches abx or xab or xabx or xaxbx

The GLOB pattern has to match the *whole* string, and may need `*` at each
end to allow it do that.

Unlike GLOB Patterns, which are anchored, Regular Expressions are not
anchored unless you make them so using the explicit anchor characters `^`
and/or `$`. Unanchored Regular Expressions "float" down the string until a
match is found, and they don't have to extend to the end of the string.

Regular Expressions can match just a piece of text in the middle of a line;
they don't have to match the whole line.

The GLOB pattern `a*b` doesn't match the string `xabx` because GLOB is
anchored and has to match the whole string, but the Regular Expression `a.*b`
does match inside the line, because it is unanchored at either end and floats
down the string and matches the `ab` in the middle of string. The regexp
starts unanchored (no `^` at the start) and thus "floats" down the string to
do the match.

Use the line start `^` and line end `$` meta-characters to **anchor** a
Regular Expression to the start or end of a line. Here are some examples of
how GLOB patterns and regexp compare:

    GLOB        Regular Expression (may use anchors)
    ----        ------------------------------------
    foo         ^foo$
    bar[abc]    ^bar[abc]$
    [!abc]      ^[^abc]$                 # note in complement GLOB uses ! vs. ^
    foo?        ^foo.$
    a*b         ^a.*b$
    *foo*       foo                      # unanchored GLOB needs * at ends
    *a*b*       a.*b                     # unanchored GLOB needs * at ends

Remember that an unanchored Regular Expression may match only *part* of a
line, e.g. the text `ab` matches only the `ab` part of `xxxabxxx`, not the
whole `xxxabxxx`. GLOB patterns must always match the entire line from start
to end; they can't match a substring inside a line the way regexp can.

Simple Basic Regular Expression Examples
========================================

When testing regular expressons with `grep`:

-   Use the color option (perhaps create an alias): `grep --color=auto`
    -   The part of the string that matched will be colored.
-   Use single quotes to protect your Regular Expression from GLOB expansion
    by the shell.

These `grep` commands select lines that match these Basic Regular
Expressions:

    grep 'ab'        # a followed by b
    grep 'a*b'       # zero or more a followed by b
    grep 'aa*b'      # one or more a followed by b
    grep 'aaa*b'     # two or more a followed by b
    grep 'a.b'       # a then one of anything then b
    grep 'a.*b'      # a then zero or more of anything, then b
    grep 'a..*b'     # a then one or more of anything then b
    grep 'a...*b'    # a then two or more of anything then b
    grep '^a'        # a must be the first character
    grep 'b$'        # b must be the last character
    grep '^a.*b$'    # a must be first, zero or more anything, b must be last

Find any line that contains at one, two, or three characters of any kind
("any kind" includes spaces and other unprintable characters):

    grep '.'         # contains at least one character (or more)
    grep '..'        # contains at least two characters (or more)
    grep '...'       # contains at least three characters (or more)

    grep '^.$'       # contains exactly one character
    grep '^..$'      # contains exactly two characters
    grep '^...$'     # contains exactly three characters

Regular Expression Character Classes `[...]` -- similar to GLOB
===============================================================

-   Character classes are lists of characters inside square brackets that
    match *one single character* from the list, e.g. `[az3]`
-   Character classes work almost the same in regexp as they do in GLOB, e.g.
    `[az3]` matches *one single character* that is `a` or `z` or `3`
-   Negated/inverted/complemented character classes use a different
    complement character! GLOB uses `[!z3c]` to invert but regexp uses
    `[^az3]` to mean: any single character that is *not* `a` or `z` or `3`
-   Character class expressions always match *exactly one* character unless
    they are repeated by appending a regexp repetition operator such as `*`
    (something you can't do with GLOB)

The characters inside the square brackets of a character class form a *set*
of characters where order doesn't matter and repeats don't affect the
meaning. All these below are equivalent and match only one single character
`a` or `z` or `3`:

    grep '[az3]'             # match one single a or z or 3
    grep '[3az]'             # same - order doesn't matter
    grep '[aaazzzz3333]'     # same - bad form - no need to repeat characters

Most Regular Expression special characters lose their meaning when inside
square brackets, but watch out for `^`, `]`, and `-` which do have special
meaning inside square brackets, depending on where they occur.

::: allbox
  ------------------------------------------------------------------------------
  Expression       Matches          Example    Example Matches  Comment
  ---------------- ---------------- ---------- ---------------- ----------------
  character        a SINGLE         `[abc]`    "a" or "b" or    like globbing
  classes `[...]`  character from              "c"              
                   the list                                     

  complement of a  a SINGLE         `[^abc]`   any SINGLE       NOT like GLOB!
  character class  character *not*             character not a  GLOB uses ! as
  `[^...]`         in the list                 or b or c        in \[!abc\]

  special          as if the        `[\]`      `\`              conditions: `]`
  character inside character is not                             must be first,
  `[...]`          special                                      `^`' must not be
                                                                first, and `-`
                                                                must be last
  ------------------------------------------------------------------------------

  : Regular Expressions Character classes `[...]`
:::

Using `^` to complement a character class set: `[^abc]`
-------------------------------------------------------

-   The `^` used immediately inside the opening square bracket of a class
    complements the whole character class set: `[^az3]`. The resulting
    character class expression matches any single character that is *not* in
    the set.

-   The complemented class `[^az3]` means "any single character that is *not*
    `a`, `z`, or `3`"

-   The `^` only works this way if it is the first character inside the
    square brackets, otherwise it has no special meaning.

-   The classes `[a^z3]` or `[az^3]` or `[az3^]` all match one of `a`, `z`,
    `3`, or `^`

-   Remember, a `^` used in a Regular Expression *outside of square brackets*
    has the special meaning "match at beginning of line". Don't confuse it
    with `^` used inside a character class.

-   Note that GLOB patterns complement character sets using `!` and not `^`:

        GLOB        Regular Expression
        [!abc]      [^abc]

Don't confuse GLOB with Regular Expressions.

Having closing `]` as part of a character class set
---------------------------------------------------

A `]` character can be placed inside square brackets to be part of the
character class set, but it has to be the first character in the set.
`[]az3]` means one of the four characters `]`, `a`, `z`, or `3` and `[^]azh]`
means any single character that is *not* one of the four characters `]`, `a`,
`z`, or `3`.

Attempting to put a closing square bracket `]` inside square brackets in any
other position is a syntax error:

-   `[ab]d]` is a failed attempt at `[ab][d]`
-   `[]` is a failed attempt at `[]]`

You can put an opening `[` anywhere in a character class, e.g.

    $ grep '[([{]` doc.txt     # search for lines with '(' or '[' or '{'

POSIX character classes -- e.g. `[:digit:]`
-------------------------------------------

[POSIX Character Class] expressions represent an entire range of characters,
such as "all the digits" or "all the letters". The classes have an awkward
syntax: The POSIX class name is preceded by `[:` and followed by `:]`,
e.g. `[:digit:]`. These are the resulting class names:

::: allbox
  POSIX Class    Description
  -------------- --------------------------------------------------------------------
  `[:alnum:]`    alphanumeric characters
  `[:alpha:]`    alphabetic characters
  `[:cntrl:]`    control characters
  `[:digit:]`    digit characters
  `[:lower:]`    lower case alphabetic characters
  `[:print:]`    visible characters, plus \[:space:\]
  `[:punct:]`    Punctuation and other symbol characters
  `[:space:]`    White space (space, tab, CR, LF) characters
  `[:upper:]`    upper case alphabetic characters
  `[:xdigit:]`   Hexadecimal digit characters
  `[:graph:]`    visible characters (anything except spaces and control characters)
:::

-   The exact content of each character class depends on the local language.
-   Only for plain ASCII is it true that "letters" means English `a-z` and
    `A-Z`.
-   Other languages have other "letters", e.g. `é`, `ç`, etc.
-   When we use the POSIX character classes, we are specifying the correct
    set of characters for the local language as per the POSIX description.

These POSIX class names only work inside an enclosing Regular Expression
character class expression using (more) square brackets. What looks like
double square brackets is really an enclosing square bracket character class
expression containing a POSIX class name (which unfortunately also uses
square brackets and colons as part of its name), e.g.

    grep '[0123456789]'     # a digit (a list of all the digits)
    grep '[[:digit:]]'      # a digit - the POSIX class name [:digit:] inside []
    grep '[abcd[:digit:]]`  # a digit or letter a or b or c or d
    grep '[ab[:digit:]cd]`  # same -- a digit or a or b or c or d
    grep '[[:digit:]abcd]`  # same -- a digit or a or b or c or d

Of course you can use multiple POSIX class names inside the character class
expression:

    grep '[[:alpha:][:digit:]]`   # a letter or a digit
    grep '[^[:alpha:][:digit:]]`  # *NOT* a letter or a digit

**WARNING:** You cannot interchange the `[:alpha:]` class and a list of all
the upper- and lower-case letters; they are not always the same because the
POSIX `[:alpha:]` class changes depending on the local language:

    grep '[[:alpha:]]`   # a letter, using the POSIX class name [:alpha:]
    grep '[a-zA-Z]'      # NOT THE SAME AS [:alpha:] - DO NOT USE !
    grep '[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]' # NOT THE SAME

POSIX Regular Expression Examples, e.g. `^[[:digit:]]*$`
--------------------------------------------------------

These expressions could be given to `grep`:

-   Any line containing nothing or only alphabetic characters from start to
    end:

        ^[[:alpha:]]*$

-   Any line containing only alphabetic characters from start to end, but
    must have at least one such character (can't be an empty line):

        ^[[:alpha:]][[:alpha:]]*$

-   Any line that begins with a digit (followed by anything or nothing):

        ^[[:digit:]]

Character class ranges using `[.-.]`
------------------------------------

-   A dash `-` between two characters inside square brackets, e.g. `[0-9]`,
    represents a range of characters between the two, unless the dash is
    first or last in the set of characters, e.g. `[-09]` or `[09-]`
-   Both `[-09]` and `[09-]` mean one of the three characters `0`, `9`, or
    `-`
-   The range expression`[0-9]` means any one character from the set of
    characters located between characters `0` and `9` inclusive.

What determines what characters line *between* other characters? The result
depends on your current [Locale] and is not well-defined.

Do not use Alphabetic Ranges that depend on Locale, e.g. `[a-z]`
----------------------------------------------------------------

Do not use alphabetic ranges (e.g. `[a-z]`)! The ranges change depending on
your system [Locale] and may change in unexpected ways:

    $ touch A B C Z a b c z

    $ LC_ALL=C

    $ echo *
    A B C Z a b c z

    $ echo [a-z]
    a b c z

    $ LC_ALL=en_CA.UTF-8

    $ echo *
    A a B b C c Z z

    $ echo [a-z]
    a B b C c Z z

-   The range `[a-z]` meaning "any one character between `a` and `z`
    inclusive" used to mean something when there was only one ASCII English
    locale.
-   Now that multiple locales exist, the meaning of "between `a` and `z`
    inclusive" is ambiguous because it means different things in different
    locales.
-   Always use the POSIX character classes for complete letter ranges,
    e.g. `[:alpha:]`.
-   If you need to specify a partial list of characters, enumerate the list;
    do not use a range: use `[abcdefgh]` not `[a-h]`

Extended Regular Expressions: `?` `+` `|` `{` `}` `(` `)`
=========================================================

Some features of Regular Expressions are called **Extended** features. These
features are described below and use more special characters: `?` `+` `|` `{`
`}` `(` `)`

Basic versus Extended Regular Expression syntax: `\|` vs. `|`
-------------------------------------------------------------

The difference between Basic and Extended Regular expressions is whether the
program requires you to use a backslash to make use of the Extended features:

    Basic:     . * ^ $ \   \| \? \+ \{ \} \( \)       # must use backslash
    Extended:  . * ^ $ \    |  ?  +  {  }  (  )       # do *NOT* use backslash

The ordinary `grep` program uses Basic Regular Expressions, so you have to
use backslashes in front of the Extended characters to turn on Extended
features. The `egrep` Extended Regular Expression program (short for
`grep -E`) doesn't need the backslashes:

    $ grep 'Accepted publickey for \(idallen\|cst8207[abc]\?\)' /var/log/auth.log
    $ egrep 'Accepted publickey for (idallen|cst8207[abc]?)' /var/log/auth.log

Basic Regular Expressions are used in these programs and you need to use
backslashes to turn on Extended features:

-   `vi`, `more`, `sed`, `awk`, `grep`

Extended Regular Expressions are used in these programs and you do *not* need
backslashes to enable Extended features:

-   `less` (e.g. `man` pages)
-   `egrep` and `grep -E`
-   `perl` and `grep -P`

> The `perl` program (and `grep -P`) has its own set of special
> Perl-compatible Regular Expression features, not described here.

Extended Feature: Repetition: `?` `+` `{n,m}`
---------------------------------------------

Extended Regular Expressions give you more options when repeating a preceding
expression:

::: allbox
  -----------------------------------------------------------------------------
  Basic                     Extended                  Repetition Meaning
  ------------------------- ------------------------- -------------------------
  `*`                       `*`                       zero or more times

  `\?`                      `?`                       zero or one times

  `\+`                      `+`                       one or more times

  `\{n\}`                   `{n}`                     n times, n is an integer

  `\{n,\}`                  `{n,}`                    n or more times, n is an
                                                      integer

  `\{,m\}`                  `{,m}`                    m or fewer times, m is an
                                                      integer (GNU extension)

  `\{n,m\}`                 `{n,m}`                   at least n, at most m
                                                      times, n and m are
                                                      integers
  -----------------------------------------------------------------------------

  : Regular Expressions -- repeat preceding (Repetition)

Examples:

    $ egrep 'colou?r' doc.txt                       # color or colour not colouur
    $ egrep 'has +spaces' doc.txt                   # one or more spaces between
    $ egrep '[0-9]{9}' doc.txt                      # 123456789
    $ egrep '[0-9]{3}-[0-9]{3}-[0-9]{4}' doc.txt    # 123-456-7890
    $ egrep '^.{80}$' doc.txt                       # 80 character lines
    $ grep '^.\{,80\}$' doc.txt                     # 80 character or fewer lines

Note that the `{,m}` capability is not available in all Extended Regular
Expressions, since it is a [GNU] extension.
:::

Extended Feature: Alternation (one *or* the other): `ab|cd`
-----------------------------------------------------------

Extended Regular Expressions give you a way of matching one expression **or**
another expression using the logical **or** bar `|` operator:

    $ grep -E 'dog|cat'  doc.txt              # find lines with dog or cat
    $ grep 'dog house\|cat fight' doc.txt     # find lines with "dog house" or "cat fight"

You can do a crude form of alternation using the `-e` option to give the
alternatives (as many as you like) in the `grep` family of programs:

    $ fgrep -e 'dog' -e 'cat' doc.txt         # find lines containing dog or cat
    $ grep -e '^dog$' -e '^cat$' doc.txt      # find lines with *only* dog or cat

The **or** `|` operator binds very loosely. Everything else has higher
precedence:

    $ grep -E '^a|b$' doc.txt                 # lines starting with a or ending with b

Extended Feature: Grouping with parentheses `a(b|c)d`
-----------------------------------------------------

Parentheses `(` and `)` are an Extended feature that can be used to group
Regular Expressions for repetition, and to override the precedence rules.

    $ egrep 'ab|cd' doc.txt              # ab or cd
    $ egrep 'a(b|c)d' doc.txt            # a followed by "b or c" followed by d

    $ grep -E '^a|b$' doc.txt            # lines starting with a or ending with b
    $ grep -E '^(a|b)$' doc.txt          # lines containing only a or only b

    $ egrep 'Bar(bar)+a' doc.txt         # Barbara, Barbarbara, etc.

(Visit Barbara at the [Rhababer-Barbara-Bar].)

Extended Feature: Tags or Backreferences; `\1` `\2` `\3`
--------------------------------------------------------

Another extended regular expression feature allows you to match later what
matched earlier in a pattern:

-   When you use parentheses for grouping, you can refer to the `n`'th group
    using `\n` (backslash followed by the number `n`).
-   The pattern `(..)\1` means any sequence of two characters that repeats,
    e.g. `abab` or `1-1-` or `XyXy`, etc.
-   The `\1` in the above example refers backward to the first group
    (parenthesized) expression.

Regular Expression Precedence
=============================

-   Repetition binds the tightest (think exponentiation).

-   Concatenation is next tightest (think multiplication).

-   Alternation has the loosest or lowest precedence (think addition).

        $ grep 'ab*' doc.txt                 # matches a followed by multiple b
        $ grep 'ab|cd' doc.txt               # matches ab or cd

As in mathematics, Regular Expression precedence can be overridden with
explicit parentheses to do grouping.

::: allbox
+------------------------+------------------------+------------------------+
| Operation              | Regex                  | Algebra                |
+========================+========================+========================+
| grouping               | () or \\(\\)           | parentheses            |
|                        |                        |                        |
|                        |                        | brackets               |
+------------------------+------------------------+------------------------+
| repetition             | * or ? or + or {n} or | exponentiation         |
|                        | {n,} or {n,m}          |                        |
|                        |                        |                        |
|                        | * or \\? or \\+ or    |                        |
|                        | \\{n\\} or \\{n,\\} or |                        |
|                        | \\{n,m\\}              |                        |
+------------------------+------------------------+------------------------+
| concatenation          | ab                     | multiplication or      |
|                        |                        | division               |
+------------------------+------------------------+------------------------+
| alternation            | | or \\\|              | addition or            |
|                        |                        | subtraction            |
+------------------------+------------------------+------------------------+

: Precedence rules summary (BEDMAS for Regexp)
:::

Backslash to remove regexp meaning of a meta-character: `\.`
============================================================

To remove the Regular Expression meaning of any Regular Expression meta
character, put a backslash in front of it. This applies to both Basic and
Extended Regular Expressions. In all types of Regular Expressions:

-   backslash `*` matches a literal asterisk
-   backslash `\.` matches a literal period
-   backslash `\\` matches a literal backslash
-   backslash `$` matches a literal dollar sign
-   backslash `\^` matches a literal circumflex

In Extended Regular Expressions, you need more backslashes to hide the
additional Extended Regular Expression meta-characters, e.g. `\+` hides the
meaning of `+` and matches a real plus sign in an Extended Regular
Expression, just as `\?` matches a real question mark:

    $ egrep 'foo\++` doc.txt        # match one or more plus signs (Extended)
    $ grep 'foo+\+` doc.txt         # match one or more plus signs (Basic)

    $ egrep 'foo\??` doc.txt        # match an optional question mark (Extended)
    $ grep 'foo?\?` doc.txt         # match an optional question mark (Basic)

Regular Expression Traps and Pitfalls
=====================================

POSIX character class names are indivisible
-------------------------------------------

The POSIX class name includes the surrounding colons and square brackets and
nothing should ever be placed inside those brackets. This is a common
mistake:

    grep '[[^:digit:]]'    # WRONG ! no longer a POSIX class name !
    grep '[^[:digit:]]'    # correct - match any single non-digit character

Using what you think is a POSIX character class outside of the enclosing
character class square brackets does not work. On some systems, `grep` will
warn you that it doesn't work:

    $ grep '[:alnum:]'       # WRONG !
    grep: character class syntax is [[:space:]], not [:space:]

On other systems, the character class expression will quietly match the list
of characters inside the outer square brackets, i.e. match one of the
characters `:`, `a`, `l`, `n`, `u`, or `m`!

Regexp matches are as long as possible
--------------------------------------

Any Regular Expression match will be as long as possible. They are called
"greedy":

-   `a.*c` matches all of `abc___abc` -- it doesn't only match the first
    `abc`.
-   You can turn off this *greedy* behaviour in some implementations of
    Regular Expressions. (See the `perl` expression `*?`, also available as
    `grep -P`.)

Don't use repeat operators at line boundaries in `grep`
-------------------------------------------------------

All the expressions below match the same set of lines containing a letter
`a`, but the first expression uses a lot less processing power than the
others:

    $ grep 'a'      file.txt    # this is the cleanest and fastest one
    $ grep 'aa*'    file.txt
    $ grep 'a.*'    file.txt
    $ grep '.*a'    file.txt
    $ grep '.*a.*'  file.txt

If you're looking for lines containing a piece of text, don't complicate the
regexp with repeat operators that waste computer time but don't change which
lines the regexp finds.

Unix/Linux regex processing is line based
-----------------------------------------

-   Linux text files are usually processed line by line when matching Regular
    Expressions; regular expressions will not cross line boundaries.
-   The newlines at the end of every line are not usually considered part of
    the text that can be matched.
-   The `vim` editor is an exception -- it has a special syntax for matching
    across line ends, e.g. `abc\ndef` and `abc\_.def`, but this doesn't work
    anywhere else (so don't worry about it here).

Regular Expressions match anywhere in a line: anchoring with `^` and `$`
------------------------------------------------------------------------

Unlike GLOB Patterns, which are anchored, Regular Expressions are not
anchored unless you make them so using the explicit anchor characters `^`
and/or `$`. Unanchored Regular Expressions "float" down the string until a
match is found, and they don't have to extend to the end of the string.

    $ echo a*b                  # anchored: matches axb not abx or xab
    $ ls | grep '^a.*b$'        # equivalent anchored Regular Expression
    $ ls | grep 'a.*b'          # NOT equivalent unanchored Regular Expression

Regular Expressions "float" down the string unless they are anchored.

Expressions matching zero length strings match everywhere
---------------------------------------------------------

-   The repetition operator `*` means "zero or more".

-   A Regular Expression consisting of zero of anything can match anywhere
    and everywhere between all the characters in a line.

-   For example, if you have a line with any 10 characters in it, the
    zero-length Regular Expression `x*` (meaning zero or more `x` characters)
    could match 11 times, before and after every one of the 10 characters (if
    it doesn't match any of the characters themselves):

        $ echo '0123456789' | sed -e 's/x*/-/g'
        -0-1-2-3-4-5-6-7-8-9-

-   The `grep` colour option and web tools such as <http://regexpal.com>
    cannot highlight matches of zero characters, but the matches are there!

-   The `vim` editor will highlight the entire line when a zero-length
    expression matches between all the characters.

Quote all regexp to hide them from the shell
--------------------------------------------

This Regular Expression below sometimes works, and sometimes does not,
depending on what file names match the `aa*` GLOB pattern in the current
directory:

    grep aa* foo.txt                      # no quotes, GLOB expands: bad idea

-   The shell will try to do filename globbing on `aa*`, possibly changing it
    into existing filenames that begin with `a` before `grep` runs: we don't
    want that.
-   Use shell quoting around Regular Expressions; don't let the shell GLOB
    expansion change the regex before `grep` sees the regex.

Alphabetic ranges are not well-defined in all Locales
-----------------------------------------------------

-   Do not use alphabetic dashed range expressions, e.g. `[a-m]`, they do not
    match what you think they match (only lower-case letters) in many common
    locales.
-   Use the POSIX character classes to match all characters,
    e.g. `[[:lower:]]`
-   Specify all the characters in a partial range, e.g. use `[abcdefghijklm]`
    not `[a-m]`
-   Numeric ranges such as `[0-9]` are accepted. (Are there any locales where
    `[0-9]` does not mean the ten digits zero through nine?)
-   You can also explicitly set your Locale to be the old ASCII Locale, which
    makes alphabetic ranges safe to use: `LC_ALL=C`

Regular Expressions in programs: `vi` `sed` `less`
==================================================

`vi` reference: <http://www.tutorialspoint.com/unix/unix-vi-editor.htm>

You can search and replace in `vi` using a Basic Regular Expression in a
Substitution line command. The substitution command by default uses slashes
to delimit the text to match and the replacement text:

    :%s/colou\?r/COLOUR/g      # make all color and colour upper-case

The program `sed` (Stream EDitor) can apply a Basic Regular Expression
substitution non-interactively by reading a file (or standard input) and
writing to standard output:

    $ sed -e 's/colou\?r/COLOUR/g' input.txt >output.txt

You can search using Regular Expressions in the interactive programs `vi`,
`more`, and `less` (and also `man`, that uses `less`) by typing a slash
followed by the Regular Expression to search for:

    /^ *read                    # find "read" at the start of a line

(Remember that `vi` and `more` use Basic Regular Expressions and `less` uses
Extended Regular Expressions.)

Example: capitalize sentences repeatedly in a document using `vi`
-----------------------------------------------------------------

Task: Any lower-case letter following a period and two spaces should be made
upper-case. Easy to do using Regular Expressions in `vi`:

-   To search forward in `vi`, type: `/\.  [[:lower:]]`
-   Then type `4~` to make four characters upper-case.
-   Then type `n` (next match) and `.` (repeat change) as many times as
    necessary.
-   The `n` command moves to the next occurrence, and `.` repeats the
    capitalization command.

Example: uncapitalize in middle of words
----------------------------------------

Any upper-case character following a lower case character should be made
lower case, e.g. `uNcapitalize` or `aWkward` or `iN`

-   to search forward in `vi`, type: `/[[:lower:]][[:upper:]]`
-   then type `l` to move one to the right (off of the lower-case letter)
-   type `~` to change the capitalization
-   type `nl.` as necessary
-   the `l` is needed because vi will position the cursor on the first
    character of the match, which in this case is a character that doesn't
    change.

> **Advanced:** In `vim` you can also use the syntax
> `/[[:lower:]][[:upper:]]/b1` to both match the text and move the
> cursor right one position. Then you can just repeat the two characters `n.`
> as many times as necessary. The `vim` editor has very advanced pattern
> search and cursor position capabilities; type `:help regexp`

Regular Expression Resources
============================

-   <http://www.regular-expressions.info/tutorial.html>
-   <http://regexone.com>
-   <http://lynda.com>
-   <http://regexpal.com>
-   <http://www.regular-expressions.info/posixbrackets.html>

<http://lynda.com>
------------------

-   Some students are already comfortable with the command line
-   For those who aren't, yet another tutorial source that might help is
    Lynda.com
-   All Algonquin students have free access to Lynda.com
-   Unix for Mac OSX users:

Lynda.com has a course on regular expressions

The problem is that it covers our material as well as some more advanced
topics that we won't cover

It is a good presentation, and the following chapters should have minimal
references to the "too advanced" material

-   Chapter 2 Characters
-   Chapter 3 Character Sets
-   Chapter 4 Repetition Expressions

Interactive Regular Expression Tutorial
---------------------------------------

For a quick interactive tutorial on Regular Expressions, see
<http://regexone.com/> but be aware that this tutorial uses some short-hand
expressions that we don't use in this course because they don't work
everywhere:

::: allbox
  Shortcut   POSIX Character Class
  ---------- ----------------------------
  `\w`       similar to `[[:alnum:]_]`
  `\W`       similar to `[^[:alnum:]_]`
  `\s`       similar to `[[:space:]]`
  `\S`       similar to `[^[:space:]]`
:::

The tutorial does not use or understand the POSIX character classes that are
more standard in Unix/Linux programs.

    -- 
    | Ian! D. Allen  -  idallen@idallen.ca  -  Ottawa, Ontario, Canada
    | Home Page: http://idallen.com/   Contact Improv: http://contactimprov.ca/
    | College professor (Free/Libre GNU+Linux) at: http://teaching.idallen.com/
    | Defend digital freedom:  http://eff.org/  and have fun:  http://fools.ca/

[Plain Text] - plain text version of this page in [Pandoc Markdown] format

  [www.idallen.com]: http://www.idallen.com/
  [Course Home Page]: ..
  [Course Outline]: course_outline.pdf
  [All Weeks]: indexcgi.cgi
  [Plain Text]: 800_regular_expressions.txt
  [Regular Expressions]: http://en.wikipedia.org/wiki/Regular_expression
  [POSIX Character Class]: http://en.wikipedia.org/wiki/Regular_expression#Character_classes
  [Locale]: /cst8177/15w/notes/000_character_sets.html
  [GNU]: http://gnu.org/
  [Rhababer-Barbara-Bar]: http://www.youtube.com/watch?v=2bim74DR9rI
  [Pandoc Markdown]: http://johnmacfarlane.net/pandoc/