------------------------------------------------ Linux Shells by Example: Chapters 5-7 Reading Guide ------------------------------------------------ -IAN! idallen@ncf.ca Here is a reading guide and some review questions for Chapters 5-7 "gawk Utility: gawk as a Linux Tool" "the gawk Utility: Evaluating Expressions" "the gawk Utility: gawk Programming" Remember to read the text_errata.txt file (under Notes) and correct all the mistakes in this Chapter before you read it. The data files for the examples in the textbook are under these directories: /home/alleni/cst/cdrom/chap05/ /home/alleni/cst/cdrom/chap06/ /home/alleni/cst/cdrom/chap07/ Useful additional notes to read: regular_expressions.txt (Basic vs. Extended Regular Expressions) Introduction to awk: Before the appearance of the Perl language, awk and its successors (nawk and gawk) were useful programs for processing files of data and manipulating the data. Perl (which took the best parts of awk and added pieces of sed, tr, and the C Library) has largely replaced awk for serious file processing. Perl has been ported to many different architectures (including Macintosh and Windows), which now makes it more common than awk. In this course we will only touch on the very basics of how awk works, so that you can use awk for simple "data mining" in shell scripts. We will not be studying the full programming language features of awk/gawk. What is awk good for? We use awk most commonly for simple extractions and manipulations of tabular data, things that awk can do more concisely than Perl. Example: Process the password file and print the line number and userid of all lines where the numeric uid plus gid is less than 50. $ awk -F: '($3 + $4) < 50 { print NR, $1 }' /etc/passwd Awk does this in one line. Perl also does it in one line: $ perl -F: -ane 'print "$. $F[0]\n" if ($F[2] + $F[3]) < 50' /etc/passwd The Perl line is longer, harder to type, and has to have all the array indexes shifted down by 1. (Awk labels fields starting with $1, the way people count; Perl numbers its arrays starting at zero.) Example: Sum the second column of all input lines and print the total and average value (handle a file with zero records correctly): $ awk ' { sum += $2; ++count; } # this line applies to all input lines END { print "total", sum if ( count == 0 ) count = 1 print "average", sum/count }' inputfile While Perl can also do this (and much, much more), the Perl code to do it is somewhat more verbose (and the array index looks wrong, but isn't): $ perl -ane ' $sum += $F[1]; ++$count; END { print "total $sum\n"; $count = 1 if $count == 0; print "average ", $sum / $count, "\n"; }' inputfile Perl takes just a bit more "syntax" to do the same thing. The economy of expression in awk is what makes awk popular for simple data manipulation in shell scripts. *) What to study in Chapter 5,6,7: 5.2 intro YES 5.2.1 YES 5.2.2 YES 5.2.3 only the -F option (set input field separator character) 5.3.1 YES 5.3.2 NO 5.3.3 NO 5.3.3 NO 5.4 NO 5.5.1 YES (know the use of $0 and NR inside awk) 5.5.2 YES 5.5.3 only single-character input field separator (FS) (no regexp) 5.6.1 YES 5.6.2 YES 5.7 intro YES (basic and extended regexp characters only) 5.7.1 YES 5.8 NO 5.9.1 YES 5.9.2 YES 5.9.3 YES 5.9.4 YES (not Example 5.55) 5.9.5 NO Linux Tools Lab 3 - questions 1 through 9 6.1 intro YES 6.1.1 YES (most are same as C language) 6.1.2 NO 6.1.3 YES (most are same as C language) 6.1.4 YES (same as C language) 6.1.5 YES (same as sed) 6.1.6 NO 6.2.1 YES 6.2.2 YES 6.2.3 YES 6.2.4 YES 6.2.5 YES 6.2.6 YES 6.2.7 NO 6.2.8 YES Linux Tools Lab 4 - all questions 7.1.1 YES 7.1.2 YES (only know these built in variables: NF, NR) p.184 7.1.3 NO 7.1.4 NO 7.2 NO 7.3 NO 7.4 NO 7.5.1 YES 7.5.2 YES (only variables NR and NF; using option "-F:") 7.5.3 NO 7.5.4 NO 7.5.5 NO 7.5.6 NO 7.5.7 NO 7.5.8 NO 7.6 NO 7.7 NO 7.8 NO 7.9 NO 7.10 NO 7.11 NO 7.12 NO 7.13 NO 7.14 NO Linux Tools Lab 5,6,7 - no questions apply *) An awk program is a list of pattern and action pairs (p.127). What is the default action, if no action is given after a pattern? What is the default pattern, if no pattern is given before an action? What happens if both pattern and action are missing? *) If you do not give awk any file names to process, what happens? (p.129) (Hint: the same thing happens with head, tail, cat, sort, etc.) *) All versions of awk can use the -F option to set a different input field separator. (The default separator is whitespace.) Only recent versions of awk accept multiple characters or a regular expression - be careful. *) Use the -F option to print only the userid fields from the Unix password file. (see section 5.5.3 for similar examples) *) True or False: The following awk lines produce identical output: $ awk -F: 'NR <= 10 { print "Userid", $1, "Shell", $7 }' /etc/passwd $ awk -F: 'NR <= 10 { print "Userid " $1 " Shell " $7 }' /etc/passwd True - the "," between print fields puts a space between in the output. Without commas, strings and fields separated by blanks are concatenated together into one big string; if you want blanks, you must add them yourself. (p.132) *) By default, awk recognizes records in a file as being separated by newline characters. What built-in awk variable counts the number of the current input record (current line)? (p.139) *) True or False: The variable $0 stands for the current program name when used inside an awk program. (p.138,139) *) By default, awk recognizes fields in a file as being separated by any amount of whitespace (blanks or tabs). What built-in awk variable counts the number of fields in the current line? (p.139) *) What command-line option tells awk to use some other character to separate fields in input lines? (p.140,141) *) What is the output of these command lines? $ awk -F: 'NR <= 10 { print "Userid", $1, "Shell", $NF }' /etc/passwd $ awk -F: 'NR <= 10 { print "Userid", $1, "Shell", $(NF-1) }' /etc/passwd $ awk -F: 'NR <= 10 { print "Userid", $1, "Shell", $(NF-2) }' /etc/passwd (Note: The syntax $NF only works in awk - don't try this to refer to positional parameters inside a shell script; it won't work!) *) By default, awk discards leading whitespace when looking for whitespace-separated fields; however, if you use the -F option to set some other non-whitespace field delimiter, then multiple and leading delimiters are *not* ignored: $ echo ' a b c ' | awk '{print $2, NF}' # prints b 3 $ echo ':a::b::c"' | awk -F: '{print $2, NF}' # prints a 6 Modern versions of awk can use a regular expression between fields: $ echo ':a::b::c"' | awk -F':+' '{print $2, NF}' # prints a 4 Remember to quote all special characters to protect them from the shell. *) Awk concatenates strings if they are only separated by whitespace. In the "print" command, you must either add your own spaces between the strings or else use commas to add spaces: (p.142) $ echo hi | awk '{ print $1 "no" "space" "here" }' # hinospacehere $ echo hi | awk '{ print $1,"ok","space","here" }' # hi ok space here *) The "pattern" in an awk script is better called an "expression". If the expression evaluates to "true", the "action" following the expression is executed. (If there is no action, the default is to print the line.) Many awk expressions contain regular expressions and other pattern-matching features; if the pattern matches the line or field in the line, that part of the expression is TRUE and the action may be executed. (p.142) *) The awk language uses traditional C-style arithmetic, logical, and relational operators in both the pattern and action parts of the language. (Recall that the Bourne-style shell "test" command uses odd-looking operators such as "-lt" and "-a", not "<" and "&&".) $ awk -F: 'NR <= 10 { print }' /etc/passwd $ awk -F: 'NR <= 10' /etc/passwd # default is to print $ awk -F: 'NF != 7 { print "Bad line: $0" }' /etc/passwd $ awk -F: 'NR <= 10 && $3 == $4' /etc/passwd # default is to print $ awk -F: '($3 + 1) < ($4 / 2)' /etc/passwd # default is to print $ awk -F: '($3 + $4) < 50' /etc/passwd # default is to print $ awk -F: '($3 + $4) < 50 { print $1, "sum is", $3 + $4 }' /etc/passwd You must protect all special characters from the shell by using quoting. *) Warning: If you use a non-numeric string in an arithmetic expression, awk will substitute the value 0 and not warn or complain! $ echo 100 11 | awk '{ print $1 + $2 }' # prints 111 $ echo hi there | awk '{ print $1 + $2 }' # prints 0 (!) *) Warning: Unlike the shell "test" command, awk does not have separate operators to compare numbers and strings. If the two items both look like numeric input, a numeric compare is used. If either item looks like a string, a string compare is used. $ echo hi there | awk '$1 == 0 { print }' # no output (strings) $ echo hi there | awk '($1+0) == 0 { print }' # prints hi there *) Actions consist of awk commands enclosed in curly braces and they follow patterns (expressions). The action executes if the expression is TRUE for the current input line. (p.143,144) $ awk -F: 'NF != 7 { print NR, $0 }' /etc/passwd The above prints the line number and line for lines that do not have exactly 7 fields. *) Awk accepts regular expressions in the pattern (expression) area. Without using the "~" (tilde) operator, the regexp is matched against the entire line. If it matches, the expression is TRUE. (p.145) $ awk -F: '/^[a-zA-Z]+[0-9]+:/ { print NR, $0 }' /etc/passwd The above prints the line number and line for lines that start with a userid containing one or more letters followed by one or more digits. All versions of awk accept basic and extended regular expressions (without needing preceding backslashes in front of the extended metacharacters). Various versions of awk also accept some "oddball" metacharacters. Concentrate on knowing "basic" and "extended". *) Introduced by awk (but since inherited by Perl) is the "~" (tilde) operator used to match a regular expression against an item and return TRUE or FALSE. The regexp *must* be on the right hand side, and must be enclosed in slashes or double quotes if it is not a field or variable name: $ echo hi there | awk '$1 ~ /^hi/' # prints hi there $ echo hi there | awk '$1 ~ /^hi /' # no output $ echo hi there | awk '$1 !~ /^hi /' # prints hi there $ echo anything | awk '"hi" !~ /^hi /' # prints anything The pair "!~" acts like the "-v" option to "grep" - if the regexp matches, the truth value is FALSE, otherwise TRUE. (p.148) *) Type in and practice the review sections: 5.9.1 to 5.9.4 *) Tables 6.1, 6.2, and 6.3 are mostly a review of C Language operators; the only new item is the awk "~" (tilde) operator. (p.163,166) *) Remember that awk performs all arithmetic in floating-point. (p.165) Arithmetic and expressions can appear in either the pattern area or the action area, or both: $ awk -F: '($3 + $4) < 50 { print $1, "sum is", $3 + $4 }' /etc/passwd Quote the awk script to protect the characters from the shell. *) Like sed, awk can apply an action to a range of lines. Unlike sed, which can address the range of lines using numbers or regular expressions, awk can *only* use a pair of regular expressions: (p.167) $ awk -F: '/^root:/,/^daemon/' /etc/passwd The above expression is TRUE for all lines after and including the first regexp, up to and including the second regexp. *) Type in and practice the review sections: 6.2.1 to 6.2.6 *) If you use a field variable (e.g. $1) on the left side of an assignment statement, you replace just that field in the current input line. If you subsequently print the line, the line will contain the changed field: (6.2.8 p.177) $ echo a b c | awk '{ $2 = "NEW"; print }' # output: a NEW c You can also replace the field with nothing: $ echo a b c | awk '{ $2 = ""; print }' # output: a c You can also replace the field with the result of an expression: $ echo 1 2 3 | awk '{ $2 *= 5; $3 *= 9; print }' # 1 10 27 And you can create new fields where none existed: $ echo 1 2 3 | awk '{ $4 = $1+$2+$3; $5="hi"; print }' # 1 2 3 6 hi *) Awk variable names, and numeric and string constants, resemble those of C Language. Unlike C language, an awk variable or field may contain either a number or a string. The same field may be treated as a number one minute and a string the next: $ echo 027 | awk '{ x=$1 $1 $1; y=$1+$1+$1; print x, y }' # 027027027 81 The variable x receives the concatenation of three copies of the string "027". The variable y receives the result of adding the number 27 to itself three times. Context determines whether an item is treated as a string or a number. Unlike shell variables, awk variable names are not preceded by dollar signs. (Only awk field names use dollar signs.) *) Unlike C language, you won't get any error message from awk if you have what appear to be "undefined" variables. Awk has no such thing as an "undefined" variable - all variables are defined to be the null string (or zero). This program produces no errors and only the word "hi" as output: $ echo hi | awk '{ print Your output is $1 }' # hi The unquoted tokens "Your", "output", and "is" are simply interpreted by awk to be variables with null values. Three null strings concatenated together into one output string disappear completely. Here is the correct code, quoting the string constant properly: $ echo hi | awk '{ print "Your output is " $1 }' # Your output is hi *) You can do "naked" user variable assignments on the awk command line, if the assignment statements appear *after* the script argument and *before* any file names: (p.183) $ echo 33 | awk '{ x=$1; y=x+z; print y, z }' z=100 # 133 100 $ echo hi | awk '{ print $1, foo }' 'foo=to my dog' # hi to my dog The "z=100" assigns the number 100 to the variable z before the awk script starts running. 'foo=to my dog' assigns the string "to my dog" to the variable "foo". (Quoting is necessary to hide the blanks in the assignment argument from the shell.) You can have several variable assignments on the command line, separated by blanks. The assignment statement must not contain any blanks between the variable name and the equals sign. Newer versions of awk let you do the same kinds of assignment using the "-v" option (more in keeping with usual Unix syntax): $ echo 33 | awk -v z=100 '{ x=$1; y=x+z; print y }' # 133 $ echo hi | awk -v 'foo=to my dog' '{ print $1, foo }' # hi to my dog The -v option (all the awk options) must precede the script argument. *) NF holds the number of fields on the current line. NR holds the number of the current input line being processed. (p.185) *) Describe what the output of this command will be: (p.185) $ awk -F: 'NF == NR' /etc/passwd *) Type in and practice the review sections: 7.5.1 to 7.5.2 That's all. Some more awk example questions will be added eventually...