-----------------------------------------------
Unix Shells by Example: Chapter 4 Reading Guide
-----------------------------------------------
-Ian! D. Allen idallen@idallen.ca

Here is a reading guide and some review questions for Chapter 4
"The GREP Family".

Remember to read the text_errata.txt file (under Notes) and correct all
the mistakes in this Chapter before you read it.

Useful additional notes to read:
    regular_expressions.txt
    regular_expression_questions.txt
    regular_expression_practice1.txt
    regular_expression_practice2.txt

The data files for the examples in the textbook are on your CDROM and
are also under this directory in the Linux Lab: /home/cst8129/chap04/

    Many of the files have been corrupted to DOS CR/LF format:

    $ file chap04/* | grep CRLF
    chap04/datafile:   ASCII text, with CRLF line terminators
    chap04/datebook:   ASCII text, with CRLF line terminators
    chap04/db:         ASCII text, with CRLF line terminators
    chap04/negative:   ASCII English text, with CRLF line terminators
    chap04/repatterns: ASCII text, with CRLF line terminators

    The extra CR character at the end of each line will make many regexp
    that try to match patterns ending in '$' fail.  You can use the
    command "dos2unix" to convert these corrupted files back to Unix
    format:  dos2unix <chap04/datafile >/tmp/fixed.txt

Note: The information in Table 4.1 is partially duplicated in
      3.1 on p.70, 4.3 on p.101 and 5.3 on p.132. 

Warning: Do not confuse the meaning of metacharacters used in regular
         expressions and those used in shell GLOB patterns.  The same
         characters are used; but, they often mean different things.

Options to know:  grep family options:  -A, -B -c -i, -l, -n, -v, -w
(a few of these are Linux-only, see p.114)

*)  What is the syntax of the "grep" command? (p.82)

*)  Are forward slashes needed in the pattern part of the grep command line?

*)  Can you use two patterns as the first argument to grep?

*)  What happens if you don't give grep any file names? (p.82)

*)  What causes each of the three exit statuses to be returned from the
    grep command?  (p.82,85)

*)  Ignore the first part of Section 4.1.3 and read the file
    "regular_expressions.txt" (under Notes) instead.  (p.83)

*)  Learn to use the Basic and Extended regular expression characters
    listed in the file "regular_expressions.txt".  You will need to know
    how to use all the Basic and Extended metacharacters in this file.

    List the Basic regular expression characters and their meanings.

    List the Extended regular expression characters and their meanings.

*)  Study well all the examples in this chapter.  Try them!  The location
    of the data files for the examples in the textbook is given above.

*)  Know the basic regexp characters:  ^ $ . * []
    Know the extended characters:  ? + | () {}
    Don't try to memorize which versions of which commands do/don't handle
    the "oddball" regular expression metacharacters and back-references.

*)  How does "fgrep" differ from both "grep" and "fgrep"?  (p.99)

*)  True or False: because fgrep does not recognize any regular
    expression metacharacters, no quoting of metacharacters is necessary
    on the fgrep command line, e.g.  $ fgrep *best* file

*)  POSIX named character classes are not supported by all programs that
    handle regular expressions.  Experiment before you use them.
    Using these classes will make your programs more portable.  (p.103)

*)  Why is the POSIX character class [:alnum:] not identical to the
    character range A-Za-z0-9 ?  (p.103)

*)  For North American ASCII, what is the one character difference
    between the POSIX character class [:alnum:] and the VI or Gnu Grep
    character class \w ?  (p.106.)

*)  Know the meaning of these options to the grep family (from Table 4.11
    on p.114):  -A -B -c -i -l -n -v -w

*)  Do the exercise on p.124.

The data files for the examples in the textbook are on the CDROM and are
also under the directory mentioned at the top of this file.

--------------------------------------
More questions on Regular Expressions:
--------------------------------------

*)  In the expression "abc*", does the "*" repeat the entire word
    "abc" zero or more times, or does it only repeat the letter "c" zero
    or more times?

*)  In the extended regular expression "(abc)+", does the "+" repeat the
    closing parenthesis one or more times, or does it repeat the entire
    parenthesized expression one or more times (e.g. abcabcabc)?

*)  How do these (extended) regular expressions differ?

      $ egrep -e '(b|B)(e|E)(e|E)(r|R)' file
      $ egrep -e '[bB][eE][eE][rR]' file
      $ egrep -i -e 'beer' file

    Which is easier to understand?
    
    Do these following expressions match exactly the same lines as the
    above expressions?

      $ egrep -e 'beer|BEER' file
      $ egrep -e '[beer][BEER]' file
      $ egrep -e '[beer]|[BEER]' file

*)  Are these following extended regular expression lines exactly equivalent?

      $ egrep -e 'a(b|c)d' file
      $ egrep -e '(ab|ac)d' file
      $ egrep -e 'a(bd|cd)' file
      $ egrep -e 'abd|acd' file

    Hint: Yes.  Concatenation and alternation of regular expressions obeys
    rules similar to multiplication and addition of numbers in arithmetic:

        ARITHMETIC:  a*(b+c)*d = (a*b+a*c)*d = a*b*d+a*c*d
        REGEXP:      a(b|c)d   = (ab|ac)d    = abd|acd

    Think of concatenation as "multiply" and alternation as "add" to
    get the precedence rules correct.

*)  Are these following lines exactly equivalent?

      $ egrep 'labell?ed' file
      $ egrep 'label(l|)ed' file

    Can the "?" metacharacter always be replaced by a parenthesized
    expression using "|" with one empty alternataive?

    Hint: Yes.  You never need to use "?" in an extended regular
    expression - it just makes some extended regular expressions shorter.

*)  Are these following lines exactly equivalent?

      $ egrep '0+' file
      $ egrep '00*' file

    Can the "+" metacharacter always be replaced by repeating the
    pattern and using "*" instead?

    Hint: Yes.  You never need to use "+" - it just makes some extended
    regular expressions shorter (sometimes a *lot* shorter!).

*)  Are these following lines exactly equivalent?

      $ egrep 'a*b*c*' file
      $ egrep '[abc]*' file
      $ egrep '(abc)*' file

    Hint: No.  Give a line that is matched by one but not the other.

*)  The following regular expressions give identical results when used by
    grep to select lines:

       $ grep '^a' /etc/passwd
       $ grep '^a.*' /etc/passwd
       $ grep '^a.*$' /etc/passwd

    Why do they give the same results?  Which one is fastest?
    Don't write complex regular expressions when simple ones will do.

    (Note that if the above patterns were used in a "sed" substitution,
    the patterns would match different things.)

*)  The following regular expressions give identical results when used by
    grep to select lines:

       $ grep 'a$' /etc/passwd
       $ grep '.*a$' /etc/passwd
       $ grep '^.*a$' /etc/passwd

    Why do they give the same results?  Which one is fastest?
    Don't write complex regular expressions when simple ones will do.

    (Note that if the above patterns were used in a "sed" substitution,
    the patterns would match different things.)

*)  Look for lines in the password file that contain four or more
    adjacent zeroes.  Use an option to display just the count of lines,
    not the lines themselves.  (Do not use "wc"; use an option to "grep".)

*)  Use an option to display just the file names of the header files in
    the /usr/include/ directory that contain the string "stdin".
    (Header files end in the two characters ".h".)  Don't display the
    matching lines, just the names of the files containing a match.
    (Answer: about 13 files, including /usr/include/stdio.h .)

*)  Repeat the above question; but, use an option to grep that will do a
    case-insensitive match that will find "stdin", "STDIN", "sTdIn", etc.
    How does the list of files output differ from the previous question?
    (Hint: put both lists of files into temporary files and run "diff"
    to see the differences.)

*)  Use an option to display the count of words in /usr/share/dict/words
    that both begin and end with the lower-case letter 'a'.
    (Answer: 1433 words)

*)  Use an option to display the count of words in /usr/share/dict/words
    that both begin and end with the lower-case letter 'a' and also
    contain a third letter 'a' somewhere in the middle.  (Answer: 595 words.)

*)  Repeat the above question, but add an option to do a case-insensitive
    match.  (Answer: 1126 words.)

*)  Use options to display the count of words in /usr/share/dict/words
    that both begin and end with the letter 'a' and also
    contain a third and a fourth letter 'a' somewhere in the middle.
    Do a case-sensitive match.  (Answer: 100 words.)
    Do a case-insensitive sensitive match.  (Answer: 191 words.)

*)  Use grep to select words from the file /usr/share/dict/words that
    have all the vowels in ascending order, "a" before "e" before "i"
    before "o" before "u", with any number of other characters in
    between.  (Answer: 247 or 250 words depending on case sensitivity.)

*)  Use grep to select words from the file /usr/share/dict/words that
    have all the individual letters in the name "elvis" in the same order,
    "e" before "l" before "v" before "i" before "s", with any number
    of other characters in between the letters.  (Answer: 134 or 135 words.
    The longest one is "pneumonoultramicroscopicsilicovolcanoconiosis".)

*)  Find which header files in the /usr/include/ directory contain the
    string "FILE".  (Header files end in the two characters ".h".)  Don't
    display the matching lines, just the names of the files containing
    a match.  (Answer: about 58 files, including /usr/include/stdio.h .)

*)  Repeat the above question, but use an option to grep to match only
    the *word* "FILE", not the string FILE.  (Answer: about 30 files.)

*)  Repeat the above question, but match the word "printf".
    (Answer: about 9 files, including /usr/include/error.h .)

*)  These mean different things:
      1. Display lines contain a character that is not the letter 'a'
      2. Display lines that do not contain the letter 'a'
    Give an example of a line that one matches but the other does not.
    How long is the shortest line output by each command?

*)  Do these command lines always give the same output?
      1. grep '[^a]'
      2. grep -v 'a'
    If they differ, give an example of a line that one matches but the
    other does not.  How long is the shortest line output by each command?

*)  Do these command lines always give the same output?
      1. grep '[^d][^o][^g]'
      2. grep -v 'dog'
    If they differ, give an example of a line that one matches but the
    other does not.  How long is the shortest line output by each command?

*)  How many lines in /usr/include/stdio.h do *not* contain any characters?
    (Note: A line with "no characters" still ends in a newline!)
    You can answer this two ways:
      1. How many lines have the end of the line right after the start?
      2. If you exclude all lines that contain any single character, how
         many lines are left over (count the non-matching lines)?
    Derive grep expressions to produce both answers.  One expression
    will probably use an option to grep to "invert" the match and
    select only non-matching lines.  (Answer: 181 lines)

*)  How many lines in /usr/include/stdio.h do *not* contain any blanks?
    You can answer this two ways:
      1. How many lines contain only zero or more non-blank characters?
      2. If you exclude all lines that contain a blank character, how
         many lines are left over (count the non-matching lines)?
    Derive grep expressions to produce both answers.  One expression
    will probably use an option to grep to "invert" the match and
    select only non-matching lines.  (Answer: 272 lines)

*)  How many lines in /usr/include/stdio.h do *not* contain any upper-
    or lower-case letters?
    You can answer this two ways:
      1. How many lines contain only zero or more non-letter characters?
      2. If you exclude all lines that contain a letter, how
         many lines are left over (count the non-matching lines)?
    Derive grep expressions to produce both answers.  One expression
    will probably use an option to grep to "invert" the match and
    select only non-matching lines.  (Answer: 183)
    (Time-saver: use a case-insensitive match.)

*)  The directory /usr/include/ is where C language keeps its standard
    header files on Unix, e.g. #include <stdio.h> refers to the file
    "/usr/include/stdio.h" and #include <sys/cdefs.h> refers to the file
    /usr/include/sys/cdefs.h.
    
    The file errno.h in the /usr/include directory contains the #define
    statements for Unix errors.  Find the #define statement that defines
    the Unix "EPERM" error ("Operation not permitted").

    Problem:
    Unfortunately, include files often contain other #include directives
    that include other files (that themselves contain #include directives
    of other files...), so you often can't find what you want by doing:
        grep -w EPERM /usr/include/errno.h    # no results!
    File errno.h includes other include files, and one of those other
    include files must contain the actual EPERM definition.

    Solution:
    Use a command and regexp to look for *both* EPERM *and* "include"
    lines (at the same time) in /usr/include/errno.h, then repeat and
    look for both strings in any #include files found.  If you don't
    find the definition there, keep repeating on all the #include
    file names in those included files and repeat the process, until
    you finally find the actual file containing the EPERM definition.
    (Manually follow the chain of #include directives.)

    What actual file contains the definition of EPERM?  What is
    the value of EPERM?  Use a grep command line to count how many
    #define statements are in this file (about 125).  Modify the grep
    expression to count *only* the define statements that define numeric
    error numbers.  (Count only lines that have #define followed by any
    number of any character followed by a number preceded by a whitespace
    character [blank or tab].  You can use the POSIX bracketed [:space:]
    and [:digit:] character classes here.)  (Answer: 122 lines)

*)  Write a small script to display just the line number of the first
    line on which a pattern is found in a file.  Use this syntax:

        $0 pattern filename

    Examples:

        $ ./myline 'struct' /usr/include/stdio.h
        45

        $ ./myline 'errlist' /usr/include/stdio.h
        554

    Hints: Use grep to find the pattern in the file and use a grep option
    to output the line number along with the line.  Use a common Unix
    command to select just the *first* line of grep output.  Split the
    line number off from the beginning of this line and display just
    the number.  (See the data_mining.txt file under Notes for techniques
    of splitting lines to get at fields.  [Hint hint: use awk with the
    '-F:' option!])  Use pipes to connect all your commands - do not
    save output in temporary files!  Your final script will probably
    contain three Unix commands in the pipeline, starting with grep.

    Validate your inputs before you use them in the script.  (Check for
    missing arguments; make sure the filename is readable, etc.)