% Regular Expressions -- matching patterns and replacing text % Ian! D. Allen -- -- [www.idallen.com] % Winter 2016 - January to April 2016 - Updated 2016-10-28 17:25 EDT - [Course Home Page] - [Course Outline] - [All Weeks] - [Plain Text] Matching Patterns: GLOB vs. Regular Expressions =============================================== There are two different pattern matching facilities that we use in Unix/Linux: **GLOB patterns** and **[Regular Expressions]**. Regular Expressions are another way to match patterns in text, similar to but more powerful than simple GLOB patterns. Pay close attention to which of the two situations you're in, because some of the same special characters common to GLOB and Regular Expressions have different meanings! GLOB patterns (review) ---------------------- There are several major places where GLOB patterns are used: ### File GLOB in the Shell: `*.txt` In the shell, GLOB patterns may be used to match existing pathnames in the file system: $ ls *.txt $ echo ?????.txt $ touch [ab]*.txt The shell tries to expand the GLOB to match existing pathnames before the associated command runs. ### `case` statement GLOB in the Shell GLOB patterns are used in shell `case` statements to match the text at the top of the `case` statement: case "$1" in /* ) type='Absolute Pathname' ;; * ) type='Relative Pathname' ;; esac ### GLOB in the `find` command: `-name '*.txt'` The `find` command `--name` operator also matches GLOB patterns against the file system, but it does so recursively in every directory, not just in one directory: $ find . -name '*.txt' $ find . -name '?????.txt' $ find . -name '[ab]*.txt' We quote the patterns above to hide them from the shell so that the `find` command receives the pattern and the shell doesn't try to expand them. Regular Expressions -- Basic and Extended ----------------------------------------- **[Regular Expressions]** (short form: *regexp*) are text matching patterns similar to GLOB patterns but more powerful. Regexp patterns use all the GLOB pattern matching characters and add more. The characters work slightly differently between GLOB and regexp. Regexp are used by many Unix/Linux programs and programming languages such as `grep`, `sed`, `awk`, `vim`, `less`, `more`, `man`, `Perl`, `python`, etc. In an editor (such as `vim` or `sed`), a Regular Expression may be used to select characters to be deleted, replaced, or exchanged: :%s/colou*r/COLOUR/g # vim replacement regular expression $ echo "Colouur bad. Colour red. Color tan." | sed -e 's/Colou*r/COLOUR/g' COLOUR bad. COLOUR red. COLOUR tan. Regexp have a **Basic** set of pattern matching characters and an **Extended** set of characters. The `grep` program family is a very popular user of both **Basic** and **Extended** Regular Expressions. The `grep` command itself accepts **Basic** Regular Expression syntax, and needs backslashes in front of some operators to access **Extended** Regular Expression features. The `egrep` command accepts **Extended** Regular Expression syntax and does not need the backslashes. You can do the same text search using either command, but the syntax changes: $ grep 'publickey for \(idallen\|cst8207[abc]\?\)' /var/log/auth.log # Basic $ egrep 'publickey for (idallen|cst8207[abc]?)' /var/log/auth.log # Extended From the section `REGULAR EXPRESSIONS` in the man page for the `grep` command: Basic vs Extended Regular Expressions In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \). Even the `bash` shell has extended syntax that allows the use of regular expressions instead of simple GLOB patterns. > **IMPORTANT:** Regular Expressions use some of the same special characters > as GLOB patterns, but they mean different things! In particular, `*`, `?`, > and `.` work differently! There are others! GLOB patterns are *anchored*; Regular Expressions *float* --------------------------------------------------------- GLOB patterns are said to be **anchored** to the start and end of the line; they must always match the entire text string (usually a file name) from the start to the end. The GLOB pattern `a*b` matches only text that starts with `a` and ends with `b` -- that GLOB pattern doesn't match just the `ab` in the middle of `xxxabxxx`. The modified GLOB pattern `*a*b*` now matches the whole text that *contains* `a` followed by `b` anywhere in the text. The modified GLOB pattern *does* match the entire text `xxxabxxx`. Regular Expressions are *not* by default anchored. They "float" down the text and they may match *anywhere* in the text string unless you explicitly anchor them to either the start or end of the text using using regexp characters `^` and/or `$`. The Regular Expression `a.*b` matches inside any text that *contains* `a` followed by `b` anywhere in the text. The floating regexp *does* match the `ab` in the middle of `xxxabxxx`. The modified Regular Expression `^a.*b$` is now **anchored** to the start and end of the text. The modified expression now matches exactly the same text as the GLOB pattern `a*b` because it forces the `a` to match at the start and the `b` to match at the end. It does *not* match inside `xxxabxxx`. You must remember to anchor the ends of your Regular Expressions if you want to be sure that they match the *whole* piece of text and not just some part of the text. Summary: - Unanchored regexp `a*b` matches (only) the text `ab` inside `xxxabxxx`. - Anchored regexp `^a*b$` does not match the `ab` inside `xxxabxxx` because the `a` has to be at the start and the `b` has to be at the end. It does match the string `aaaaab`. Regular Expressions compared with Algebraic Expressions ======================================================= Like algebraic expressions, more complex Regular Expressions are built up by combining simpler expressions. Regular Expressions have operators similar to algebraic operators, but they mean different things than in algebra. Like algebraic operators, Regular Expression operators have bindings and precedence when combined with other operators. Before we look at Regular Expressions, let's take a look at some Algebraic Expressions you're already comfortable with. Larger Algebraic Expressions are formed by putting smaller expressions together: ::: allbox +------------------------+------------------------+------------------------+ | Expression | Meaning | Comment | +========================+========================+========================+ | a | a | a simple expression | +------------------------+------------------------+------------------------+ | b | b | another simple | | | | expression | +------------------------+------------------------+------------------------+ | ab | a x b | ab is a larger | | | | expression formed from | | | | two smaller ones | | | | | | | | concatenating two | | | | expressions together | | | | means to multiply them | +------------------------+------------------------+------------------------+ | b^2^ | b x b | we might have | | | | represented this with | | | | b\^2, using \^ as an | | | | exponentiation | | | | operator | +------------------------+------------------------+------------------------+ | ab^2^ | a x (b x b) | not (a x b) x (a x b) | +------------------------+------------------------+------------------------+ | (ab)^2^ | (a x b) x (a x b) | parentheses for | | | | grouping | +------------------------+------------------------+------------------------+ : Algebraic Expressions ::: Basic Regular Expressions using `*` repetition (zero or more) and parentheses ----------------------------------------------------------------------------- Similar to an algebraic exponent, the asterisk/star `*` Regular Expression operator binds tightly to the immediately preceding Regular Expression and repeats it zero or more times. Parentheses (a feature of Extended Regular Expressions) can be used for grouping, e.g. $ grep 'suc*eed' document.txt # find sueed, suceed, succeed, succceed, etc. $ grep 'Bar\(bar\)*a' document.txt # find Bara, Barbara, Barbarbara, etc. $ egrep 'Bar(bar)*a' document.txt # use egrep Extended regexp syntax > Rhabarbara: Parentheses need backslashes in front of them when using a program such as `grep` that uses **Basic** Regular Expression syntax. The `egrep` program accepts **Extended** Regular Expression syntax and does not need the backslashes. ::: allbox +------------------------+------------------------+------------------------+ | Expression | Meaning | Comment | +========================+========================+========================+ | `a` | match single 'a' | a simple expression | +------------------------+------------------------+------------------------+ | `b` | match single 'b' | another simple | | | | expression | +------------------------+------------------------+------------------------+ | `ab` | match strings | "ab" is a larger | | | consisting of single | expression formed from | | | 'a' followed by single | two smaller ones | | | 'b' | | | | | concatenating two | | | | regular expressions | | | | together means | | | | "followed immediately | | | | by" and we'll say | | | | "followed by" | +------------------------+------------------------+------------------------+ | `b*` | match zero or more 'b' | a big difference in | | | characters | meaning from the '*' | | | | in globbing! This is | | | | the regular expression | | | | repetition operator. | +------------------------+------------------------+------------------------+ | `ab*` | 'a' followed by zero | why not repeating the | | | or more 'b' characters | two characters 'ab' | | | | zero or more times? | | | | Hint: think of "ab^2^" | | | | in algebra. | +------------------------+------------------------+------------------------+ | `\(ab\)*` | ('a' followed by 'b'), | We can use | | | zero or more times | parenthesis; in Basic | | | | Regular Expressions, | | | | we use `\(` and `\)` | +------------------------+------------------------+------------------------+ : Regular Expressions using `*` repetition (zero or more) and parentheses ::: Concatenating and repeating Regular Expressions using `*` and `\(...\)` ----------------------------------------------------------------------- As with algebraic multiplication, there is no operator to concatenate Regular Expressions to match longer strings. Simple write one expression and follow it with the next one. Similar to an algebraic exponent, the asterisk/star `*` Regular Expression operator binds tightly to the immediately preceding Regular Expression and repeats it zero or more times. Parentheses can be used for grouping, e.g. ::: allbox ------------------------------------------------------------------------------- Expression Matches Example Example Matches Comment ---------------- ---------------- ----------- ---------------- ---------------- one expression first followed `xy` "xy" like globbing followed by by second another expression zero or more `x*` "" or "x" or NOT like the `*` followed by `*` matches of the "xx" or "xxx" in globbing, immediately ...etc although `.*` preceding behaves like `*` expression in globbing expression in the expression `\(ab\)` "ab" parentheses are parentheses used for groups expression in the expression `\(ab\)*` "" or "ab" or parentheses are parentheses, repeated zero or "abab" or used for groups followed by `*` more times "ababab", etc. ------------------------------------------------------------------------------- : Concatenating and repeating Regular Expressions using `*` and `\(...\)` ::: Special Characters in Basic Regular Expressions ----------------------------------------------- Regular Expressions have more special characters than GLOB patterns. Some special characters need backslashes in front of them to enable them in **Basic** Regular Expressions. ::: allbox ------------------------------------------------------------------------------ Character Matches Example Example Matches Comment ---------------- ---------------- ---------- ---------------- ---------------- non-special itself `x` "x" like globbing character `.` any single `.` "x" or "y" or like the '?' in character "!" or "." or globbing "*" ...etc `^` *used at beginning of a `^x` "x" if it's the anchors the start of regexp* line of text first character match to the on the line beginning of a line `^` *when not `^` *(itself)* `a^b` "a\^b" \^ has no used at start of special meaning regexp* unless its first `$` *at end of end of a line of `x$` "x" if it's the anchors the regexp* text last character match to the end on the line of a line `$` *when not `$` *(itself)* `a$b` "a$b" $ has no used at end of special meaning regexp* unless its last `\` followed by that character `\.` "." like globbing a special with its special character meaning removed `\` followed by the non-special `\a` "a" \\ before a a non-special character (no non-special character change) character is ignored `[` and `]` character class `[abc]` "abc" see Class below ------------------------------------------------------------------------------ : Special Characters in Basic Regular Expressions ::: Regular Expressions match anywhere in a line: anchoring with `^` and `$` ------------------------------------------------------------------------ GLOB Patterns are said to be **anchored** to the start and end of the string being matched. The GLOB pattern `a*b` matches text `axb` but not `abx` or `xab`. The `a` has to be at the start, and the `b` has to be at the end. To allow a GLOB pattern to be *unanchored* and match anywhere inside a string, you need to pad the GLOB with `*` on both sides: $ echo a*b # anchored: matches axb not abx or xab $ echo *a*b* # now matches abx or xab or xabx or xaxbx The GLOB pattern has to match the *whole* string, and may need `*` at each end to allow it do that. Unlike GLOB Patterns, which are anchored, Regular Expressions are not anchored unless you make them so using the explicit anchor characters `^` and/or `$`. Unanchored Regular Expressions "float" down the string until a match is found, and they don't have to extend to the end of the string. Regular Expressions can match just a piece of text in the middle of a line; they don't have to match the whole line. The GLOB pattern `a*b` doesn't match the string `xabx` because GLOB is anchored and has to match the whole string, but the Regular Expression `a.*b` does match inside the line, because it is unanchored at either end and floats down the string and matches the `ab` in the middle of string. The regexp starts unanchored (no `^` at the start) and thus "floats" down the string to do the match. Use the line start `^` and line end `$` meta-characters to **anchor** a Regular Expression to the start or end of a line. Here are some examples of how GLOB patterns and regexp compare: GLOB Regular Expression (may use anchors) ---- ------------------------------------ foo ^foo$ bar[abc] ^bar[abc]$ [!abc] ^[^abc]$ # note in complement GLOB uses ! vs. ^ foo? ^foo.$ a*b ^a.*b$ *foo* foo # unanchored GLOB needs * at ends *a*b* a.*b # unanchored GLOB needs * at ends Remember that an unanchored Regular Expression may match only *part* of a line, e.g. the text `ab` matches only the `ab` part of `xxxabxxx`, not the whole `xxxabxxx`. GLOB patterns must always match the entire line from start to end; they can't match a substring inside a line the way regexp can. Simple Basic Regular Expression Examples ======================================== When testing regular expressons with `grep`: - Use the color option (perhaps create an alias): `grep --color=auto` - The part of the string that matched will be colored. - Use single quotes to protect your Regular Expression from GLOB expansion by the shell. These `grep` commands select lines that match these Basic Regular Expressions: grep 'ab' # a followed by b grep 'a*b' # zero or more a followed by b grep 'aa*b' # one or more a followed by b grep 'aaa*b' # two or more a followed by b grep 'a.b' # a then one of anything then b grep 'a.*b' # a then zero or more of anything, then b grep 'a..*b' # a then one or more of anything then b grep 'a...*b' # a then two or more of anything then b grep '^a' # a must be the first character grep 'b$' # b must be the last character grep '^a.*b$' # a must be first, zero or more anything, b must be last Find any line that contains at one, two, or three characters of any kind ("any kind" includes spaces and other unprintable characters): grep '.' # contains at least one character (or more) grep '..' # contains at least two characters (or more) grep '...' # contains at least three characters (or more) grep '^.$' # contains exactly one character grep '^..$' # contains exactly two characters grep '^...$' # contains exactly three characters Regular Expression Character Classes `[...]` -- similar to GLOB =============================================================== - Character classes are lists of characters inside square brackets that match *one single character* from the list, e.g. `[az3]` - Character classes work almost the same in regexp as they do in GLOB, e.g. `[az3]` matches *one single character* that is `a` or `z` or `3` - Negated/inverted/complemented character classes use a different complement character! GLOB uses `[!z3c]` to invert but regexp uses `[^az3]` to mean: any single character that is *not* `a` or `z` or `3` - Character class expressions always match *exactly one* character unless they are repeated by appending a regexp repetition operator such as `*` (something you can't do with GLOB) The characters inside the square brackets of a character class form a *set* of characters where order doesn't matter and repeats don't affect the meaning. All these below are equivalent and match only one single character `a` or `z` or `3`: grep '[az3]' # match one single a or z or 3 grep '[3az]' # same - order doesn't matter grep '[aaazzzz3333]' # same - bad form - no need to repeat characters Most Regular Expression special characters lose their meaning when inside square brackets, but watch out for `^`, `]`, and `-` which do have special meaning inside square brackets, depending on where they occur. ::: allbox ------------------------------------------------------------------------------ Expression Matches Example Example Matches Comment ---------------- ---------------- ---------- ---------------- ---------------- character a SINGLE `[abc]` "a" or "b" or like globbing classes `[...]` character from "c" the list complement of a a SINGLE `[^abc]` any SINGLE NOT like GLOB! character class character *not* character not a GLOB uses ! as `[^...]` in the list or b or c in \[!abc\] special as if the `[\]` `\` conditions: `]` character inside character is not must be first, `[...]` special `^`' must not be first, and `-` must be last ------------------------------------------------------------------------------ : Regular Expressions Character classes `[...]` ::: Using `^` to complement a character class set: `[^abc]` ------------------------------------------------------- - The `^` used immediately inside the opening square bracket of a class complements the whole character class set: `[^az3]`. The resulting character class expression matches any single character that is *not* in the set. - The complemented class `[^az3]` means "any single character that is *not* `a`, `z`, or `3`" - The `^` only works this way if it is the first character inside the square brackets, otherwise it has no special meaning. - The classes `[a^z3]` or `[az^3]` or `[az3^]` all match one of `a`, `z`, `3`, or `^` - Remember, a `^` used in a Regular Expression *outside of square brackets* has the special meaning "match at beginning of line". Don't confuse it with `^` used inside a character class. - Note that GLOB patterns complement character sets using `!` and not `^`: GLOB Regular Expression [!abc] [^abc] Don't confuse GLOB with Regular Expressions. Having closing `]` as part of a character class set --------------------------------------------------- A `]` character can be placed inside square brackets to be part of the character class set, but it has to be the first character in the set. `[]az3]` means one of the four characters `]`, `a`, `z`, or `3` and `[^]azh]` means any single character that is *not* one of the four characters `]`, `a`, `z`, or `3`. Attempting to put a closing square bracket `]` inside square brackets in any other position is a syntax error: - `[ab]d]` is a failed attempt at `[ab][d]` - `[]` is a failed attempt at `[]]` You can put an opening `[` anywhere in a character class, e.g. $ grep '[([{]` doc.txt # search for lines with '(' or '[' or '{' POSIX character classes -- e.g. `[:digit:]` ------------------------------------------- [POSIX Character Class] expressions represent an entire range of characters, such as "all the digits" or "all the letters". The classes have an awkward syntax: The POSIX class name is preceded by `[:` and followed by `:]`, e.g. `[:digit:]`. These are the resulting class names: ::: allbox POSIX Class Description -------------- -------------------------------------------------------------------- `[:alnum:]` alphanumeric characters `[:alpha:]` alphabetic characters `[:cntrl:]` control characters `[:digit:]` digit characters `[:lower:]` lower case alphabetic characters `[:print:]` visible characters, plus \[:space:\] `[:punct:]` Punctuation and other symbol characters `[:space:]` White space (space, tab, CR, LF) characters `[:upper:]` upper case alphabetic characters `[:xdigit:]` Hexadecimal digit characters `[:graph:]` visible characters (anything except spaces and control characters) ::: - The exact content of each character class depends on the local language. - Only for plain ASCII is it true that "letters" means English `a-z` and `A-Z`. - Other languages have other "letters", e.g. `é`, `ç`, etc. - When we use the POSIX character classes, we are specifying the correct set of characters for the local language as per the POSIX description. These POSIX class names only work inside an enclosing Regular Expression character class expression using (more) square brackets. What looks like double square brackets is really an enclosing square bracket character class expression containing a POSIX class name (which unfortunately also uses square brackets and colons as part of its name), e.g. grep '[0123456789]' # a digit (a list of all the digits) grep '[[:digit:]]' # a digit - the POSIX class name [:digit:] inside [] grep '[abcd[:digit:]]` # a digit or letter a or b or c or d grep '[ab[:digit:]cd]` # same -- a digit or a or b or c or d grep '[[:digit:]abcd]` # same -- a digit or a or b or c or d Of course you can use multiple POSIX class names inside the character class expression: grep '[[:alpha:][:digit:]]` # a letter or a digit grep '[^[:alpha:][:digit:]]` # *NOT* a letter or a digit **WARNING:** You cannot interchange the `[:alpha:]` class and a list of all the upper- and lower-case letters; they are not always the same because the POSIX `[:alpha:]` class changes depending on the local language: grep '[[:alpha:]]` # a letter, using the POSIX class name [:alpha:] grep '[a-zA-Z]' # NOT THE SAME AS [:alpha:] - DO NOT USE ! grep '[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]' # NOT THE SAME POSIX Regular Expression Examples, e.g. `^[[:digit:]]*$` -------------------------------------------------------- These expressions could be given to `grep`: - Any line containing nothing or only alphabetic characters from start to end: ^[[:alpha:]]*$ - Any line containing only alphabetic characters from start to end, but must have at least one such character (can't be an empty line): ^[[:alpha:]][[:alpha:]]*$ - Any line that begins with a digit (followed by anything or nothing): ^[[:digit:]] Character class ranges using `[.-.]` ------------------------------------ - A dash `-` between two characters inside square brackets, e.g. `[0-9]`, represents a range of characters between the two, unless the dash is first or last in the set of characters, e.g. `[-09]` or `[09-]` - Both `[-09]` and `[09-]` mean one of the three characters `0`, `9`, or `-` - The range expression`[0-9]` means any one character from the set of characters located between characters `0` and `9` inclusive. What determines what characters line *between* other characters? The result depends on your current [Locale] and is not well-defined. Do not use Alphabetic Ranges that depend on Locale, e.g. `[a-z]` ---------------------------------------------------------------- Do not use alphabetic ranges (e.g. `[a-z]`)! The ranges change depending on your system [Locale] and may change in unexpected ways: $ touch A B C Z a b c z $ LC_ALL=C $ echo * A B C Z a b c z $ echo [a-z] a b c z $ LC_ALL=en_CA.UTF-8 $ echo * A a B b C c Z z $ echo [a-z] a B b C c Z z - The range `[a-z]` meaning "any one character between `a` and `z` inclusive" used to mean something when there was only one ASCII English locale. - Now that multiple locales exist, the meaning of "between `a` and `z` inclusive" is ambiguous because it means different things in different locales. - Always use the POSIX character classes for complete letter ranges, e.g. `[:alpha:]`. - If you need to specify a partial list of characters, enumerate the list; do not use a range: use `[abcdefgh]` not `[a-h]` Extended Regular Expressions: `?` `+` `|` `{` `}` `(` `)` ========================================================= Some features of Regular Expressions are called **Extended** features. These features are described below and use more special characters: `?` `+` `|` `{` `}` `(` `)` Basic versus Extended Regular Expression syntax: `\|` vs. `|` ------------------------------------------------------------- The difference between Basic and Extended Regular expressions is whether the program requires you to use a backslash to make use of the Extended features: Basic: . * ^ $ \ \| \? \+ \{ \} \( \) # must use backslash Extended: . * ^ $ \ | ? + { } ( ) # do *NOT* use backslash The ordinary `grep` program uses Basic Regular Expressions, so you have to use backslashes in front of the Extended characters to turn on Extended features. The `egrep` Extended Regular Expression program (short for `grep -E`) doesn't need the backslashes: $ grep 'Accepted publickey for \(idallen\|cst8207[abc]\?\)' /var/log/auth.log $ egrep 'Accepted publickey for (idallen|cst8207[abc]?)' /var/log/auth.log Basic Regular Expressions are used in these programs and you need to use backslashes to turn on Extended features: - `vi`, `more`, `sed`, `awk`, `grep` Extended Regular Expressions are used in these programs and you do *not* need backslashes to enable Extended features: - `less` (e.g. `man` pages) - `egrep` and `grep -E` - `perl` and `grep -P` > The `perl` program (and `grep -P`) has its own set of special > Perl-compatible Regular Expression features, not described here. Extended Feature: Repetition: `?` `+` `{n,m}` --------------------------------------------- Extended Regular Expressions give you more options when repeating a preceding expression: ::: allbox ----------------------------------------------------------------------------- Basic Extended Repetition Meaning ------------------------- ------------------------- ------------------------- `*` `*` zero or more times `\?` `?` zero or one times `\+` `+` one or more times `\{n\}` `{n}` n times, n is an integer `\{n,\}` `{n,}` n or more times, n is an integer `\{,m\}` `{,m}` m or fewer times, m is an integer (GNU extension) `\{n,m\}` `{n,m}` at least n, at most m times, n and m are integers ----------------------------------------------------------------------------- : Regular Expressions -- repeat preceding (Repetition) Examples: $ egrep 'colou?r' doc.txt # color or colour not colouur $ egrep 'has +spaces' doc.txt # one or more spaces between $ egrep '[0-9]{9}' doc.txt # 123456789 $ egrep '[0-9]{3}-[0-9]{3}-[0-9]{4}' doc.txt # 123-456-7890 $ egrep '^.{80}$' doc.txt # 80 character lines $ grep '^.\{,80\}$' doc.txt # 80 character or fewer lines Note that the `{,m}` capability is not available in all Extended Regular Expressions, since it is a [GNU] extension. ::: Extended Feature: Alternation (one *or* the other): `ab|cd` ----------------------------------------------------------- Extended Regular Expressions give you a way of matching one expression **or** another expression using the logical **or** bar `|` operator: $ grep -E 'dog|cat' doc.txt # find lines with dog or cat $ grep 'dog house\|cat fight' doc.txt # find lines with "dog house" or "cat fight" You can do a crude form of alternation using the `-e` option to give the alternatives (as many as you like) in the `grep` family of programs: $ fgrep -e 'dog' -e 'cat' doc.txt # find lines containing dog or cat $ grep -e '^dog$' -e '^cat$' doc.txt # find lines with *only* dog or cat The **or** `|` operator binds very loosely. Everything else has higher precedence: $ grep -E '^a|b$' doc.txt # lines starting with a or ending with b Extended Feature: Grouping with parentheses `a(b|c)d` ----------------------------------------------------- Parentheses `(` and `)` are an Extended feature that can be used to group Regular Expressions for repetition, and to override the precedence rules. $ egrep 'ab|cd' doc.txt # ab or cd $ egrep 'a(b|c)d' doc.txt # a followed by "b or c" followed by d $ grep -E '^a|b$' doc.txt # lines starting with a or ending with b $ grep -E '^(a|b)$' doc.txt # lines containing only a or only b $ egrep 'Bar(bar)+a' doc.txt # Barbara, Barbarbara, etc. (Visit Barbara at the [Rhababer-Barbara-Bar].) Extended Feature: Tags or Backreferences; `\1` `\2` `\3` -------------------------------------------------------- Another extended regular expression feature allows you to match later what matched earlier in a pattern: - When you use parentheses for grouping, you can refer to the `n`'th group using `\n` (backslash followed by the number `n`). - The pattern `(..)\1` means any sequence of two characters that repeats, e.g. `abab` or `1-1-` or `XyXy`, etc. - The `\1` in the above example refers backward to the first group (parenthesized) expression. Regular Expression Precedence ============================= - Repetition binds the tightest (think exponentiation). - Concatenation is next tightest (think multiplication). - Alternation has the loosest or lowest precedence (think addition). $ grep 'ab*' doc.txt # matches a followed by multiple b $ grep 'ab|cd' doc.txt # matches ab or cd As in mathematics, Regular Expression precedence can be overridden with explicit parentheses to do grouping. ::: allbox +------------------------+------------------------+------------------------+ | Operation | Regex | Algebra | +========================+========================+========================+ | grouping | () or \\(\\) | parentheses | | | | | | | | brackets | +------------------------+------------------------+------------------------+ | repetition | * or ? or + or {n} or | exponentiation | | | {n,} or {n,m} | | | | | | | | * or \\? or \\+ or | | | | \\{n\\} or \\{n,\\} or | | | | \\{n,m\\} | | +------------------------+------------------------+------------------------+ | concatenation | ab | multiplication or | | | | division | +------------------------+------------------------+------------------------+ | alternation | | or \\\| | addition or | | | | subtraction | +------------------------+------------------------+------------------------+ : Precedence rules summary (BEDMAS for Regexp) ::: Backslash to remove regexp meaning of a meta-character: `\.` ============================================================ To remove the Regular Expression meaning of any Regular Expression meta character, put a backslash in front of it. This applies to both Basic and Extended Regular Expressions. In all types of Regular Expressions: - backslash `*` matches a literal asterisk - backslash `\.` matches a literal period - backslash `\\` matches a literal backslash - backslash `$` matches a literal dollar sign - backslash `\^` matches a literal circumflex In Extended Regular Expressions, you need more backslashes to hide the additional Extended Regular Expression meta-characters, e.g. `\+` hides the meaning of `+` and matches a real plus sign in an Extended Regular Expression, just as `\?` matches a real question mark: $ egrep 'foo\++` doc.txt # match one or more plus signs (Extended) $ grep 'foo+\+` doc.txt # match one or more plus signs (Basic) $ egrep 'foo\??` doc.txt # match an optional question mark (Extended) $ grep 'foo?\?` doc.txt # match an optional question mark (Basic) Regular Expression Traps and Pitfalls ===================================== POSIX character class names are indivisible ------------------------------------------- The POSIX class name includes the surrounding colons and square brackets and nothing should ever be placed inside those brackets. This is a common mistake: grep '[[^:digit:]]' # WRONG ! no longer a POSIX class name ! grep '[^[:digit:]]' # correct - match any single non-digit character Using what you think is a POSIX character class outside of the enclosing character class square brackets does not work. On some systems, `grep` will warn you that it doesn't work: $ grep '[:alnum:]' # WRONG ! grep: character class syntax is [[:space:]], not [:space:] On other systems, the character class expression will quietly match the list of characters inside the outer square brackets, i.e. match one of the characters `:`, `a`, `l`, `n`, `u`, or `m`! Regexp matches are as long as possible -------------------------------------- Any Regular Expression match will be as long as possible. They are called "greedy": - `a.*c` matches all of `abc___abc` -- it doesn't only match the first `abc`. - You can turn off this *greedy* behaviour in some implementations of Regular Expressions. (See the `perl` expression `*?`, also available as `grep -P`.) Don't use repeat operators at line boundaries in `grep` ------------------------------------------------------- All the expressions below match the same set of lines containing a letter `a`, but the first expression uses a lot less processing power than the others: $ grep 'a' file.txt # this is the cleanest and fastest one $ grep 'aa*' file.txt $ grep 'a.*' file.txt $ grep '.*a' file.txt $ grep '.*a.*' file.txt If you're looking for lines containing a piece of text, don't complicate the regexp with repeat operators that waste computer time but don't change which lines the regexp finds. Unix/Linux regex processing is line based ----------------------------------------- - Linux text files are usually processed line by line when matching Regular Expressions; regular expressions will not cross line boundaries. - The newlines at the end of every line are not usually considered part of the text that can be matched. - The `vim` editor is an exception -- it has a special syntax for matching across line ends, e.g. `abc\ndef` and `abc\_.def`, but this doesn't work anywhere else (so don't worry about it here). Regular Expressions match anywhere in a line: anchoring with `^` and `$` ------------------------------------------------------------------------ Unlike GLOB Patterns, which are anchored, Regular Expressions are not anchored unless you make them so using the explicit anchor characters `^` and/or `$`. Unanchored Regular Expressions "float" down the string until a match is found, and they don't have to extend to the end of the string. $ echo a*b # anchored: matches axb not abx or xab $ ls | grep '^a.*b$' # equivalent anchored Regular Expression $ ls | grep 'a.*b' # NOT equivalent unanchored Regular Expression Regular Expressions "float" down the string unless they are anchored. Expressions matching zero length strings match everywhere --------------------------------------------------------- - The repetition operator `*` means "zero or more". - A Regular Expression consisting of zero of anything can match anywhere and everywhere between all the characters in a line. - For example, if you have a line with any 10 characters in it, the zero-length Regular Expression `x*` (meaning zero or more `x` characters) could match 11 times, before and after every one of the 10 characters (if it doesn't match any of the characters themselves): $ echo '0123456789' | sed -e 's/x*/-/g' -0-1-2-3-4-5-6-7-8-9- - The `grep` colour option and web tools such as cannot highlight matches of zero characters, but the matches are there! - The `vim` editor will highlight the entire line when a zero-length expression matches between all the characters. Quote all regexp to hide them from the shell -------------------------------------------- This Regular Expression below sometimes works, and sometimes does not, depending on what file names match the `aa*` GLOB pattern in the current directory: grep aa* foo.txt # no quotes, GLOB expands: bad idea - The shell will try to do filename globbing on `aa*`, possibly changing it into existing filenames that begin with `a` before `grep` runs: we don't want that. - Use shell quoting around Regular Expressions; don't let the shell GLOB expansion change the regex before `grep` sees the regex. Alphabetic ranges are not well-defined in all Locales ----------------------------------------------------- - Do not use alphabetic dashed range expressions, e.g. `[a-m]`, they do not match what you think they match (only lower-case letters) in many common locales. - Use the POSIX character classes to match all characters, e.g. `[[:lower:]]` - Specify all the characters in a partial range, e.g. use `[abcdefghijklm]` not `[a-m]` - Numeric ranges such as `[0-9]` are accepted. (Are there any locales where `[0-9]` does not mean the ten digits zero through nine?) - You can also explicitly set your Locale to be the old ASCII Locale, which makes alphabetic ranges safe to use: `LC_ALL=C` Regular Expressions in programs: `vi` `sed` `less` ================================================== `vi` reference: You can search and replace in `vi` using a Basic Regular Expression in a Substitution line command. The substitution command by default uses slashes to delimit the text to match and the replacement text: :%s/colou\?r/COLOUR/g # make all color and colour upper-case The program `sed` (Stream EDitor) can apply a Basic Regular Expression substitution non-interactively by reading a file (or standard input) and writing to standard output: $ sed -e 's/colou\?r/COLOUR/g' input.txt >output.txt You can search using Regular Expressions in the interactive programs `vi`, `more`, and `less` (and also `man`, that uses `less`) by typing a slash followed by the Regular Expression to search for: /^ *read # find "read" at the start of a line (Remember that `vi` and `more` use Basic Regular Expressions and `less` uses Extended Regular Expressions.) Example: capitalize sentences repeatedly in a document using `vi` ----------------------------------------------------------------- Task: Any lower-case letter following a period and two spaces should be made upper-case. Easy to do using Regular Expressions in `vi`: - To search forward in `vi`, type: `/\. [[:lower:]]` - Then type `4~` to make four characters upper-case. - Then type `n` (next match) and `.` (repeat change) as many times as necessary. - The `n` command moves to the next occurrence, and `.` repeats the capitalization command. Example: uncapitalize in middle of words ---------------------------------------- Any upper-case character following a lower case character should be made lower case, e.g. `uNcapitalize` or `aWkward` or `iN` - to search forward in `vi`, type: `/[[:lower:]][[:upper:]]` - then type `l` to move one to the right (off of the lower-case letter) - type `~` to change the capitalization - type `nl.` as necessary - the `l` is needed because vi will position the cursor on the first character of the match, which in this case is a character that doesn't change. > **Advanced:** In `vim` you can also use the syntax > `/[[:lower:]][[:upper:]]/b1` to both match the text and move the > cursor right one position. Then you can just repeat the two characters `n.` > as many times as necessary. The `vim` editor has very advanced pattern > search and cursor position capabilities; type `:help regexp` Regular Expression Resources ============================ - - - - - ------------------ - Some students are already comfortable with the command line - For those who aren't, yet another tutorial source that might help is Lynda.com - All Algonquin students have free access to Lynda.com - Unix for Mac OSX users: Lynda.com has a course on regular expressions The problem is that it covers our material as well as some more advanced topics that we won't cover It is a good presentation, and the following chapters should have minimal references to the "too advanced" material - Chapter 2 Characters - Chapter 3 Character Sets - Chapter 4 Repetition Expressions Interactive Regular Expression Tutorial --------------------------------------- For a quick interactive tutorial on Regular Expressions, see but be aware that this tutorial uses some short-hand expressions that we don't use in this course because they don't work everywhere: ::: allbox Shortcut POSIX Character Class ---------- ---------------------------- `\w` similar to `[[:alnum:]_]` `\W` similar to `[^[:alnum:]_]` `\s` similar to `[[:space:]]` `\S` similar to `[^[:space:]]` ::: The tutorial does not use or understand the POSIX character classes that are more standard in Unix/Linux programs. -- | Ian! D. Allen - idallen@idallen.ca - Ottawa, Ontario, Canada | Home Page: http://idallen.com/ Contact Improv: http://contactimprov.ca/ | College professor (Free/Libre GNU+Linux) at: http://teaching.idallen.com/ | Defend digital freedom: http://eff.org/ and have fun: http://fools.ca/ [Plain Text] - plain text version of this page in [Pandoc Markdown] format [www.idallen.com]: http://www.idallen.com/ [Course Home Page]: .. [Course Outline]: course_outline.pdf [All Weeks]: indexcgi.cgi [Plain Text]: 800_regular_expressions.txt [Regular Expressions]: http://en.wikipedia.org/wiki/Regular_expression [POSIX Character Class]: http://en.wikipedia.org/wiki/Regular_expression#Character_classes [Locale]: /cst8177/15w/notes/000_character_sets.html [GNU]: http://gnu.org/ [Rhababer-Barbara-Bar]: http://www.youtube.com/watch?v=2bim74DR9rI [Pandoc Markdown]: http://johnmacfarlane.net/pandoc/