------------------------------------------------
Linux Shells by Example: Chapters 5-7 Reading Guide
------------------------------------------------
-IAN! idallen@ncf.ca

Here is a reading guide and some review questions for Chapters 5-7
    "gawk Utility: gawk as a Linux Tool"
    "the gawk Utility: Evaluating Expressions"
    "the gawk Utility: gawk Programming"

Remember to read the text_errata.txt file (under Notes) and correct all
the mistakes in this Chapter before you read it.

The data files for the examples in the textbook are under these directories:
    /home/alleni/cst/cdrom/chap05/
    /home/alleni/cst/cdrom/chap06/
    /home/alleni/cst/cdrom/chap07/

Useful additional notes to read:
    regular_expressions.txt  (Basic vs. Extended Regular Expressions)

Introduction to awk:

    Before the appearance of the Perl language, awk and its successors
    (nawk and gawk) were useful programs for processing files of data and
    manipulating the data.  Perl (which took the best parts of awk and
    added pieces of sed, tr, and the C Library) has largely replaced awk
    for serious file processing.  Perl has been ported to many different
    architectures (including Macintosh and Windows), which now makes
    it more common than awk.  In this course we will only touch on the
    very basics of how awk works, so that you can use awk for simple
    "data mining" in shell scripts.  We will not be studying the full
    programming language features of awk/gawk.

What is awk good for?

    We use awk most commonly for simple extractions and manipulations
    of tabular data, things that awk can do more concisely than Perl.

    Example:  Process the password file and print the line number and
    userid of all lines where the numeric uid plus gid is less than 50.

    $ awk -F: '($3 + $4) < 50 { print NR, $1 }' /etc/passwd

    Awk does this in one line.  Perl also does it in one line:

    $ perl -F: -ane 'print "$. $F[0]\n" if ($F[2] + $F[3]) < 50' /etc/passwd

    The Perl line is longer, harder to type, and has to have all the
    array indexes shifted down by 1.  (Awk labels fields starting with
    $1, the way people count; Perl numbers its arrays starting at zero.)

    Example:  Sum the second column of all input lines and print the
    total and average value (handle a file with zero records correctly):

    $ awk '
            { sum += $2; ++count; }   # this line applies to all input lines
    END     {
                    print "total", sum
                    if ( count == 0 ) count = 1
                    print "average", sum/count
            }' inputfile

    While Perl can also do this (and much, much more), the Perl code
    to do it is somewhat more verbose (and the array index looks wrong,
    but isn't):

    $ perl -ane '
            $sum += $F[1]; ++$count;
    END     {
                    print "total $sum\n";
                    $count = 1 if $count == 0;
                    print "average ", $sum / $count, "\n";
            }' inputfile

    Perl takes just a bit more "syntax" to do the same thing.
    
    The economy of expression in awk is what makes awk popular for simple
    data manipulation in shell scripts.

*)  What to study in Chapter 5,6,7:

    5.2 intro YES
    5.2.1 YES
    5.2.2 YES
    5.2.3 only the -F option (set input field separator character)
    5.3.1 YES
    5.3.2 NO
    5.3.3 NO
    5.3.3 NO
    5.4   NO
    5.5.1 YES (know the use of $0 and NR inside awk)
    5.5.2 YES
    5.5.3 only single-character input field separator (FS) (no regexp)
    5.6.1 YES
    5.6.2 YES
    5.7 intro YES (basic and extended regexp characters only)
    5.7.1 YES
    5.8   NO
    5.9.1 YES
    5.9.2 YES
    5.9.3 YES
    5.9.4 YES (not Example 5.55)
    5.9.5 NO
    Linux Tools Lab 3 - questions 1 through 9

    6.1 intro YES
    6.1.1 YES (most are same as C language)
    6.1.2 NO
    6.1.3 YES (most are same as C language)
    6.1.4 YES (same as C language)
    6.1.5 YES (same as sed)
    6.1.6 NO
    6.2.1 YES
    6.2.2 YES
    6.2.3 YES
    6.2.4 YES
    6.2.5 YES
    6.2.6 YES
    6.2.7 NO
    6.2.8 YES
    Linux Tools Lab 4 - all questions

    7.1.1 YES
    7.1.2 YES (only know these built in variables: NF, NR) p.184
    7.1.3 NO
    7.1.4 NO
    7.2   NO
    7.3   NO
    7.4   NO
    7.5.1 YES
    7.5.2 YES (only variables NR and NF; using option "-F:")
    7.5.3 NO
    7.5.4 NO
    7.5.5 NO
    7.5.6 NO
    7.5.7 NO
    7.5.8 NO
    7.6   NO
    7.7   NO
    7.8   NO
    7.9   NO
    7.10  NO
    7.11  NO
    7.12  NO
    7.13  NO
    7.14  NO
    Linux Tools Lab 5,6,7 - no questions apply

*)  An awk program is a list of pattern and action pairs (p.127).
    What is the default action, if no action is given after a pattern?
    What is the default pattern, if no pattern is given before an action?
    What happens if both pattern and action are missing?

*)  If you do not give awk any file names to process, what happens?  (p.129)
    (Hint: the same thing happens with head, tail, cat, sort, etc.)

*)  All versions of awk can use the -F option to set a different input
    field separator.  (The default separator is whitespace.)  Only
    recent versions of awk accept multiple characters or a regular
    expression - be careful.

*)  Use the -F option to print only the userid fields from the Unix
    password file.  (see section 5.5.3 for similar examples)

*)  True or False: The following awk lines produce identical output:

    $ awk -F: 'NR <= 10 { print "Userid", $1, "Shell", $7 }' /etc/passwd
    $ awk -F: 'NR <= 10 { print "Userid " $1 " Shell " $7 }' /etc/passwd

    True - the "," between print fields puts a space between in the output.
    Without commas, strings and fields separated by blanks are
    concatenated together into one big string; if you want blanks,
    you must add them yourself.  (p.132)

*)  By default, awk recognizes records in a file as being separated by
    newline characters.  What built-in awk variable counts the number
    of the current input record (current line)?  (p.139)

*)  True or False: The variable $0 stands for the current program name
    when used inside an awk program. (p.138,139)

*)  By default, awk recognizes fields in a file as being separated by
    any amount of whitespace (blanks or tabs).  What built-in awk variable
    counts the number of fields in the current line?  (p.139)
    
*)  What command-line option tells awk to use some other character to
    separate fields in input lines?  (p.140,141)

*)  What is the output of these command lines?

    $ awk -F: 'NR <= 10 { print "Userid", $1, "Shell", $NF }' /etc/passwd
    $ awk -F: 'NR <= 10 { print "Userid", $1, "Shell", $(NF-1) }' /etc/passwd
    $ awk -F: 'NR <= 10 { print "Userid", $1, "Shell", $(NF-2) }' /etc/passwd

    (Note: The syntax $NF only works in awk - don't try this to refer
    to positional parameters inside a shell script; it won't work!)

*)  By default, awk discards leading whitespace when looking for
    whitespace-separated fields; however, if you use the -F option to
    set some other non-whitespace field delimiter, then multiple and
    leading delimiters are *not* ignored:

    $ echo ' a  b  c ' | awk '{print $2, NF}'         # prints b 3
    $ echo ':a::b::c"' | awk -F: '{print $2, NF}'     # prints a 6

    Modern versions of awk can use a regular expression between fields:

    $ echo ':a::b::c"' | awk -F':+' '{print $2, NF}'  # prints a 4

    Remember to quote all special characters to protect them from the shell.

*)  Awk concatenates strings if they are only separated by whitespace.
    In the "print" command, you must either add your own spaces
    between the strings or else use commas to add spaces:  (p.142)

    $ echo hi | awk '{ print $1 "no" "space" "here" }'  # hinospacehere
    $ echo hi | awk '{ print $1,"ok","space","here" }'  # hi ok space here

*)  The "pattern" in an awk script is better called an "expression".
    If the expression evaluates to "true", the "action" following the
    expression is executed.  (If there is no action, the default is to
    print the line.)  Many awk expressions contain regular expressions
    and other pattern-matching features; if the pattern matches the line
    or field in the line, that part of the expression is TRUE and the
    action may be executed.  (p.142)

*)  The awk language uses traditional C-style arithmetic, logical, and
    relational operators in both the pattern and action parts of the
    language.  (Recall that the Bourne-style shell "test" command uses
    odd-looking operators such as "-lt" and "-a", not "<" and "&&".)

    $ awk -F: 'NR <= 10 { print }' /etc/passwd
    $ awk -F: 'NR <= 10' /etc/passwd             # default is to print
    $ awk -F: 'NF != 7 { print "Bad line: $0" }' /etc/passwd
    $ awk -F: 'NR <= 10 && $3 == $4' /etc/passwd # default is to print
    $ awk -F: '($3 + 1) < ($4 / 2)' /etc/passwd  # default is to print
    $ awk -F: '($3 + $4) < 50' /etc/passwd  # default is to print
    $ awk -F: '($3 + $4) < 50 { print $1, "sum is", $3 + $4 }' /etc/passwd

    You must protect all special characters from the shell by using quoting.

*)  Warning: If you use a non-numeric string in an arithmetic expression,
    awk will substitute the value 0 and not warn or complain!

    $ echo 100 11   | awk '{ print $1 + $2 }'     # prints 111
    $ echo hi there | awk '{ print $1 + $2 }'     # prints 0  (!)

*)  Warning: Unlike the shell "test" command, awk does not have separate
    operators to compare numbers and strings.  If the two items both
    look like numeric input, a numeric compare is used.  If either item
    looks like a string, a string compare is used.

    $ echo hi there | awk '$1 == 0 { print }'         # no output (strings)
    $ echo hi there | awk '($1+0) == 0 { print }'     # prints hi there

*)  Actions consist of awk commands enclosed in curly braces and they
    follow patterns (expressions).  The action executes if the expression
    is TRUE for the current input line.  (p.143,144)

    $ awk -F: 'NF != 7 { print NR, $0 }' /etc/passwd

    The above prints the line number and line for lines that do not have
    exactly 7 fields.

*)  Awk accepts regular expressions in the pattern (expression) area. 
    Without using the "~" (tilde) operator, the regexp is matched against
    the entire line.  If it matches, the expression is TRUE.  (p.145)

    $ awk -F: '/^[a-zA-Z]+[0-9]+:/ { print NR, $0 }' /etc/passwd

    The above prints the line number and line for lines that start with a
    userid containing one or more letters followed by one or more digits.

    All versions of awk accept basic and extended regular expressions
    (without needing preceding backslashes in front of the extended
    metacharacters).  Various versions of awk also accept some "oddball"
    metacharacters.  Concentrate on knowing "basic" and "extended".

*)  Introduced by awk (but since inherited by Perl) is the "~" (tilde)
    operator used to match a regular expression against an item and
    return TRUE or FALSE.  The regexp *must* be on the right hand side,
    and must be enclosed in slashes or double quotes if it is not a
    field or variable name:

    $ echo hi there | awk '$1 ~ /^hi/'     # prints hi there
    $ echo hi there | awk '$1 ~ /^hi /'    # no output
    $ echo hi there | awk '$1 !~ /^hi /'   # prints hi there
    $ echo anything | awk '"hi" !~ /^hi /' # prints anything

    The pair "!~" acts like the "-v" option to "grep" - if the regexp
    matches, the truth value is FALSE, otherwise TRUE.  (p.148)

*)  Type in and practice the review sections: 5.9.1 to 5.9.4

*)  Tables 6.1, 6.2, and 6.3 are mostly a review of C Language operators;
    the only new item is the awk "~" (tilde) operator.  (p.163,166)

*)  Remember that awk performs all arithmetic in floating-point.
    (p.165)  Arithmetic and expressions can appear in either the
    pattern area or the action area, or both:

    $ awk -F: '($3 + $4) < 50 { print $1, "sum is", $3 + $4 }' /etc/passwd

    Quote the awk script to protect the characters from the shell.

*)  Like sed, awk can apply an action to a range of lines.  Unlike
    sed, which can address the range of lines using numbers or regular
    expressions, awk can *only* use a pair of regular expressions: (p.167)

    $ awk -F: '/^root:/,/^daemon/' /etc/passwd

    The above expression is TRUE for all lines after and including the
    first regexp, up to and including the second regexp.

*)  Type in and practice the review sections: 6.2.1 to 6.2.6

*)  If you use a field variable (e.g. $1) on the left side of an
    assignment statement, you replace just that field in the current
    input line.  If you subsequently print the line, the line will
    contain the changed field: (6.2.8 p.177)

    $ echo a b c | awk '{ $2 = "NEW"; print }'     # output: a NEW c

    You can also replace the field with nothing:

    $ echo a b c | awk '{ $2 = ""; print }'        # output: a  c

    You can also replace the field with the result of an expression:

    $ echo 1 2 3 | awk '{ $2 *= 5; $3 *= 9; print }' # 1 10 27 

    And you can create new fields where none existed:

    $ echo 1 2 3 | awk '{ $4 = $1+$2+$3; $5="hi"; print }' # 1 2 3 6 hi

*)  Awk variable names, and numeric and string constants, resemble those
    of C Language.  Unlike C language, an awk variable or field may
    contain either a number or a string.  The same field may be
    treated as a number one minute and a string the next:

    $ echo 027 | awk '{ x=$1 $1 $1; y=$1+$1+$1; print x, y }'  # 027027027 81

    The variable x receives the concatenation of three copies of the
    string "027".  The variable y receives the result of adding the
    number 27 to itself three times.  Context determines whether an item
    is treated as a string or a number.

    Unlike shell variables, awk variable names are not preceded by
    dollar signs.  (Only awk field names use dollar signs.)

*)  Unlike C language, you won't get any error message from awk if you
    have what appear to be "undefined" variables.   Awk has no such thing
    as an "undefined" variable - all variables are defined to be the null
    string (or zero).  This program produces no errors and only the
    word "hi" as output:

    $ echo hi | awk '{ print Your output is $1 }'     # hi

    The unquoted tokens "Your", "output", and "is" are simply interpreted
    by awk to be variables with null values.  Three null strings
    concatenated together into one output string disappear completely.
    Here is the correct code, quoting the string constant properly:

    $ echo hi | awk '{ print "Your output is " $1 }' # Your output is hi

*)  You can do "naked" user variable assignments on the awk command line,
    if the assignment statements appear *after* the script argument and
    *before* any file names:  (p.183)

    $ echo 33 | awk '{ x=$1; y=x+z; print y, z }' z=100  # 133 100
    $ echo hi | awk '{ print $1, foo }' 'foo=to my dog'  # hi to my dog

    The "z=100" assigns the number 100 to the variable z before the
    awk script starts running.  'foo=to my dog' assigns the string
    "to my dog" to the variable "foo".  (Quoting is necessary to hide
    the blanks in the assignment argument from the shell.)

    You can have several variable assignments on the command line,
    separated by blanks.  The assignment statement must not contain any
    blanks between the variable name and the equals sign.

    Newer versions of awk let you do the same kinds of assignment
    using the "-v" option (more in keeping with usual Unix syntax):

    $ echo 33 | awk -v z=100 '{ x=$1; y=x+z; print y }'    # 133
    $ echo hi | awk -v 'foo=to my dog' '{ print $1, foo }' # hi to my dog

    The -v option (all the awk options) must precede the script argument.

*)  NF holds the number of fields on the current line.
    NR holds the number of the current input line being processed.  (p.185)

*)  Describe what the output of this command will be: (p.185)

    $ awk -F: 'NF == NR' /etc/passwd

*)  Type in and practice the review sections: 7.5.1 to 7.5.2

That's all.  Some more awk example questions will be added eventually...