=========================================================== Regular Expressions - notes on Basic, Extended, and OddBall =========================================================== -Ian! D. Allen idallen@idallen.ca Ignore the confusing mess at the start of Basic and Extended Regular Expressions in your Quigley text. Here's what you need to know about Basic and Extended metacharacters from that section. WARNING: Beyond the basic regular expression metacharacters that work everywhere, Unix/Linux is a mess of compatibility differences. Experiment before you use anything in the extended or "oddball" metacharacter set, below. ------------------------- Basic Regular Expressions - Must Know ------------------------- The basic regular expression metacharacters that work everywhere are these: ^ $ . * [ ] The "^" is a metacharacter only when it appears at the start (left side) of a regular expression, e.g. /.^/ matches any one character followed by a real circumflex. The "$" is a metacharacter only when it appears at the end (right side) of a regular expression, e.g. /$./ matches a real dollar sign followed by any one character. Inside [] you can use a leading "^" to complement (invert) the list of characters matched, e.g. [^abc], and you can use "-" between characters to indicate a range, e.g. [a-z0-9]. The other metacharacters have no meaning when used inside [], e.g. [.*$] matches a single character that is either a period, an asterisk, or a dollar sign. The "^" has no special meaning when not used immediately after the opening bracket, e.g. [a^] matches a single character that is a letter "a" or a circumflex "^". The above metacharacters work in everything on all versions of Unix/Linux that understand regular expressions, including grep, sed, vi, ed, Perl, and awk/gawk. They never need leading backslashes to work as regular expression metacharacters. Use them with confidence, everywhere. ---------------------------- Extended Regular Expressions - Must Know ---------------------------- The "extended" regular expression metacharacters are these: ? + | ( ) The above extended metacharacters were first understood by "egrep" and also work, without backslashes in front, in some other programs (e.g. Perl, awk). Most other programs (e.g. sed, vi, grep) require the extended metacharacters to be backslashed to work as metacharacters, e.g. \?, \(, \+, etc. Not all programs handle all of the extended regular expression characters, and not all programs handle all the features that they provide. Use the above extended metacharacters with caution in versions of programs that accept extended regular expressions: egrep - no backslashes needed - all characters work perl - no backslashes needed - all characters work awk - no backslashes needed - all characters work sed - must precede with backslashes; some may not work vi - must precede with backslashes; some may not work ed - must precede with backslashes; some may not work Warning: Not all versions of sed, vi, and ed will accept extended regular expressions, whether backslashed or not. Experiment first. Note: Sometimes the * and + metacharacters can be used to repeat an entire parenthesized expression, e.g. (abc)* matches abcabcabc; however, this isn't always true. (Solaris sed fails here - the * only repeats the previous character, not the entire group.) Experiment first. --------------- Back-References - Must Know --------------- The metacharacter back-references \1, \2, etc. (or $1, $2, etc. in Perl), only work in programs that understand "extended" regular expressions (you need working parentheses); but, they do not always work as part of the regular expression itself. The regular expression pattern below (find repeated characters) contains an embedded back-reference, and it only works in some versions of some programs that handle extended regular expressions: (.)\1 Experiment before you use back-references in the regular expression itself. Some programs will not accept that the back-references appear in the regular expression itself; but, you can always use them on the right-hand side (the replacement text half) of a "substitute" command: s/([a-z]+)([0-9]+)/\2\1/ The above pattern (swap alphabetics and numerics) with back-references always works in programs that handle extended regular expressions and "substitute" commands. (In Perl, the backslashes should be dollar signs.) In sed, vi, and ed the extended regular expression metacharacters must be preceded by backslashes to turn them "on": s/\([a-z]\+\)\([0-9]\+\)/\2\1/ Don't backslash the backslashes in the replacement text! Summary: Back-references don't always work when embedded in the regular expression half; they do work in the replacement text half. --------------------------- Oddball Regular Expressions - May Know --------------------------- These characters and sequences were added to specific programs and have been gradually appearing in other programs over the years: ODDBALL: The metacharacters \< and \> originated in Berkeley vi and also work in some other programs. They always appear as two characters, with a backslash first. Not all programs understand them. Experiment before you use them. ODDBALL: The metacharacter braces { and } work in a few versions of some regular expression programs. Most programs (e.g. ed, vim, GNU grep) require the metacharacters to be backslashed to work, e.g. \{, \}. Not all programs understand these braces. Experiment before you use them. ODDBALL: Perl introduced some additional escape sequences for regular expressions, and these have caught on elsewhere in the GNU/Linux world. Some GNU/Linux regular expression programs (e.g. vim) understand these Perl-isms: \w \W \s \S \d \D \a \A \b etc. Most programs do not understand some or all of these Perl-like escape sequences. Experiment before you use them. Avoid using these oddball characters in scripts, if you want the scripts to work across different Unix versions.