----------------------------------------------- Unix Shells by Example: Chapter 4 Reading Guide ----------------------------------------------- -Ian! D. Allen idallen@idallen.ca Here is a reading guide and some review questions for Chapter 4 "The GREP Family". Remember to read the text_errata.txt file (under Notes) and correct all the mistakes in this Chapter before you read it. Useful additional notes to read: regular_expressions.txt regular_expression_questions.txt regular_expression_practice1.txt regular_expression_practice2.txt The data files for the examples in the textbook are on your CDROM and are also under this directory in the Linux Lab: /home/cst8129/chap04/ Many of the files have been corrupted to DOS CR/LF format: $ file chap04/* | grep CRLF chap04/datafile: ASCII text, with CRLF line terminators chap04/datebook: ASCII text, with CRLF line terminators chap04/db: ASCII text, with CRLF line terminators chap04/negative: ASCII English text, with CRLF line terminators chap04/repatterns: ASCII text, with CRLF line terminators The extra CR character at the end of each line will make many regexp that try to match patterns ending in '$' fail. You can use the command "dos2unix" to convert these corrupted files back to Unix format: dos2unix /tmp/fixed.txt Note: The information in Table 4.1 is partially duplicated in 3.1 on p.70, 4.3 on p.101 and 5.3 on p.132. Warning: Do not confuse the meaning of metacharacters used in regular expressions and those used in shell GLOB patterns. The same characters are used; but, they often mean different things. Options to know: grep family options: -A, -B -c -i, -l, -n, -v, -w (a few of these are Linux-only, see p.114) *) What is the syntax of the "grep" command? (p.82) *) Are forward slashes needed in the pattern part of the grep command line? *) Can you use two patterns as the first argument to grep? *) What happens if you don't give grep any file names? (p.82) *) What causes each of the three exit statuses to be returned from the grep command? (p.82,85) *) Ignore the first part of Section 4.1.3 and read the file "regular_expressions.txt" (under Notes) instead. (p.83) *) Learn to use the Basic and Extended regular expression characters listed in the file "regular_expressions.txt". You will need to know how to use all the Basic and Extended metacharacters in this file. List the Basic regular expression characters and their meanings. List the Extended regular expression characters and their meanings. *) Study well all the examples in this chapter. Try them! The location of the data files for the examples in the textbook is given above. *) Know the basic regexp characters: ^ $ . * [] Know the extended characters: ? + | () {} Don't try to memorize which versions of which commands do/don't handle the "oddball" regular expression metacharacters and back-references. *) How does "fgrep" differ from both "grep" and "fgrep"? (p.99) *) True or False: because fgrep does not recognize any regular expression metacharacters, no quoting of metacharacters is necessary on the fgrep command line, e.g. $ fgrep *best* file *) POSIX named character classes are not supported by all programs that handle regular expressions. Experiment before you use them. Using these classes will make your programs more portable. (p.103) *) Why is the POSIX character class [:alnum:] not identical to the character range A-Za-z0-9 ? (p.103) *) For North American ASCII, what is the one character difference between the POSIX character class [:alnum:] and the VI or Gnu Grep character class \w ? (p.106.) *) Know the meaning of these options to the grep family (from Table 4.11 on p.114): -A -B -c -i -l -n -v -w *) Do the exercise on p.124. The data files for the examples in the textbook are on the CDROM and are also under the directory mentioned at the top of this file. -------------------------------------- More questions on Regular Expressions: -------------------------------------- *) In the expression "abc*", does the "*" repeat the entire word "abc" zero or more times, or does it only repeat the letter "c" zero or more times? *) In the extended regular expression "(abc)+", does the "+" repeat the closing parenthesis one or more times, or does it repeat the entire parenthesized expression one or more times (e.g. abcabcabc)? *) How do these (extended) regular expressions differ? $ egrep -e '(b|B)(e|E)(e|E)(r|R)' file $ egrep -e '[bB][eE][eE][rR]' file $ egrep -i -e 'beer' file Which is easier to understand? Do these following expressions match exactly the same lines as the above expressions? $ egrep -e 'beer|BEER' file $ egrep -e '[beer][BEER]' file $ egrep -e '[beer]|[BEER]' file *) Are these following extended regular expression lines exactly equivalent? $ egrep -e 'a(b|c)d' file $ egrep -e '(ab|ac)d' file $ egrep -e 'a(bd|cd)' file $ egrep -e 'abd|acd' file Hint: Yes. Concatenation and alternation of regular expressions obeys rules similar to multiplication and addition of numbers in arithmetic: ARITHMETIC: a*(b+c)*d = (a*b+a*c)*d = a*b*d+a*c*d REGEXP: a(b|c)d = (ab|ac)d = abd|acd Think of concatenation as "multiply" and alternation as "add" to get the precedence rules correct. *) Are these following lines exactly equivalent? $ egrep 'labell?ed' file $ egrep 'label(l|)ed' file Can the "?" metacharacter always be replaced by a parenthesized expression using "|" with one empty alternataive? Hint: Yes. You never need to use "?" in an extended regular expression - it just makes some extended regular expressions shorter. *) Are these following lines exactly equivalent? $ egrep '0+' file $ egrep '00*' file Can the "+" metacharacter always be replaced by repeating the pattern and using "*" instead? Hint: Yes. You never need to use "+" - it just makes some extended regular expressions shorter (sometimes a *lot* shorter!). *) Are these following lines exactly equivalent? $ egrep 'a*b*c*' file $ egrep '[abc]*' file $ egrep '(abc)*' file Hint: No. Give a line that is matched by one but not the other. *) The following regular expressions give identical results when used by grep to select lines: $ grep '^a' /etc/passwd $ grep '^a.*' /etc/passwd $ grep '^a.*$' /etc/passwd Why do they give the same results? Which one is fastest? Don't write complex regular expressions when simple ones will do. (Note that if the above patterns were used in a "sed" substitution, the patterns would match different things.) *) The following regular expressions give identical results when used by grep to select lines: $ grep 'a$' /etc/passwd $ grep '.*a$' /etc/passwd $ grep '^.*a$' /etc/passwd Why do they give the same results? Which one is fastest? Don't write complex regular expressions when simple ones will do. (Note that if the above patterns were used in a "sed" substitution, the patterns would match different things.) *) Look for lines in the password file that contain four or more adjacent zeroes. Use an option to display just the count of lines, not the lines themselves. (Do not use "wc"; use an option to "grep".) *) Use an option to display just the file names of the header files in the /usr/include/ directory that contain the string "stdin". (Header files end in the two characters ".h".) Don't display the matching lines, just the names of the files containing a match. (Answer: about 13 files, including /usr/include/stdio.h .) *) Repeat the above question; but, use an option to grep that will do a case-insensitive match that will find "stdin", "STDIN", "sTdIn", etc. How does the list of files output differ from the previous question? (Hint: put both lists of files into temporary files and run "diff" to see the differences.) *) Use an option to display the count of words in /usr/share/dict/words that both begin and end with the lower-case letter 'a'. (Answer: 1433 words) *) Use an option to display the count of words in /usr/share/dict/words that both begin and end with the lower-case letter 'a' and also contain a third letter 'a' somewhere in the middle. (Answer: 595 words.) *) Repeat the above question, but add an option to do a case-insensitive match. (Answer: 1126 words.) *) Use options to display the count of words in /usr/share/dict/words that both begin and end with the letter 'a' and also contain a third and a fourth letter 'a' somewhere in the middle. Do a case-sensitive match. (Answer: 100 words.) Do a case-insensitive sensitive match. (Answer: 191 words.) *) Use grep to select words from the file /usr/share/dict/words that have all the vowels in ascending order, "a" before "e" before "i" before "o" before "u", with any number of other characters in between. (Answer: 247 or 250 words depending on case sensitivity.) *) Use grep to select words from the file /usr/share/dict/words that have all the individual letters in the name "elvis" in the same order, "e" before "l" before "v" before "i" before "s", with any number of other characters in between the letters. (Answer: 134 or 135 words. The longest one is "pneumonoultramicroscopicsilicovolcanoconiosis".) *) Find which header files in the /usr/include/ directory contain the string "FILE". (Header files end in the two characters ".h".) Don't display the matching lines, just the names of the files containing a match. (Answer: about 58 files, including /usr/include/stdio.h .) *) Repeat the above question, but use an option to grep to match only the *word* "FILE", not the string FILE. (Answer: about 30 files.) *) Repeat the above question, but match the word "printf". (Answer: about 9 files, including /usr/include/error.h .) *) These mean different things: 1. Display lines contain a character that is not the letter 'a' 2. Display lines that do not contain the letter 'a' Give an example of a line that one matches but the other does not. How long is the shortest line output by each command? *) Do these command lines always give the same output? 1. grep '[^a]' 2. grep -v 'a' If they differ, give an example of a line that one matches but the other does not. How long is the shortest line output by each command? *) Do these command lines always give the same output? 1. grep '[^d][^o][^g]' 2. grep -v 'dog' If they differ, give an example of a line that one matches but the other does not. How long is the shortest line output by each command? *) How many lines in /usr/include/stdio.h do *not* contain any characters? (Note: A line with "no characters" still ends in a newline!) You can answer this two ways: 1. How many lines have the end of the line right after the start? 2. If you exclude all lines that contain any single character, how many lines are left over (count the non-matching lines)? Derive grep expressions to produce both answers. One expression will probably use an option to grep to "invert" the match and select only non-matching lines. (Answer: 181 lines) *) How many lines in /usr/include/stdio.h do *not* contain any blanks? You can answer this two ways: 1. How many lines contain only zero or more non-blank characters? 2. If you exclude all lines that contain a blank character, how many lines are left over (count the non-matching lines)? Derive grep expressions to produce both answers. One expression will probably use an option to grep to "invert" the match and select only non-matching lines. (Answer: 272 lines) *) How many lines in /usr/include/stdio.h do *not* contain any upper- or lower-case letters? You can answer this two ways: 1. How many lines contain only zero or more non-letter characters? 2. If you exclude all lines that contain a letter, how many lines are left over (count the non-matching lines)? Derive grep expressions to produce both answers. One expression will probably use an option to grep to "invert" the match and select only non-matching lines. (Answer: 183) (Time-saver: use a case-insensitive match.) *) The directory /usr/include/ is where C language keeps its standard header files on Unix, e.g. #include refers to the file "/usr/include/stdio.h" and #include refers to the file /usr/include/sys/cdefs.h. The file errno.h in the /usr/include directory contains the #define statements for Unix errors. Find the #define statement that defines the Unix "EPERM" error ("Operation not permitted"). Problem: Unfortunately, include files often contain other #include directives that include other files (that themselves contain #include directives of other files...), so you often can't find what you want by doing: grep -w EPERM /usr/include/errno.h # no results! File errno.h includes other include files, and one of those other include files must contain the actual EPERM definition. Solution: Use a command and regexp to look for *both* EPERM *and* "include" lines (at the same time) in /usr/include/errno.h, then repeat and look for both strings in any #include files found. If you don't find the definition there, keep repeating on all the #include file names in those included files and repeat the process, until you finally find the actual file containing the EPERM definition. (Manually follow the chain of #include directives.) What actual file contains the definition of EPERM? What is the value of EPERM? Use a grep command line to count how many #define statements are in this file (about 125). Modify the grep expression to count *only* the define statements that define numeric error numbers. (Count only lines that have #define followed by any number of any character followed by a number preceded by a whitespace character [blank or tab]. You can use the POSIX bracketed [:space:] and [:digit:] character classes here.) (Answer: 122 lines) *) Write a small script to display just the line number of the first line on which a pattern is found in a file. Use this syntax: $0 pattern filename Examples: $ ./myline 'struct' /usr/include/stdio.h 45 $ ./myline 'errlist' /usr/include/stdio.h 554 Hints: Use grep to find the pattern in the file and use a grep option to output the line number along with the line. Use a common Unix command to select just the *first* line of grep output. Split the line number off from the beginning of this line and display just the number. (See the data_mining.txt file under Notes for techniques of splitting lines to get at fields. [Hint hint: use awk with the '-F:' option!]) Use pipes to connect all your commands - do not save output in temporary files! Your final script will probably contain three Unix commands in the pipeline, starting with grep. Validate your inputs before you use them in the script. (Check for missing arguments; make sure the filename is readable, etc.)