===================================================================
Using commands and pipes to "mine" and extract data from the system
===================================================================
-Ian! D. Allen - idallen@idallen.ca - www.idallen.com

Because of the power of Unix pipes and the rich set of command-line tools
available, Unix programmers are often asked to extract or "mine" data from
various text files, or to convert files from one format to another format.

The "mining" operation can take many forms; but, a common form is to
process a stream of text and extract certain fields from certain lines.
One set of commands selects the lines to extract; the other set of
commands picks off the desired fields from those lines (or vice-versa).
Often these two operations are repeated, narrowing down the selection
until just the desired information is displayed.

Data mining is easy, if you build up the Unix pipeline slowly, adding
one command at a time and watching the output each time.

Some Unix commands select lines from a text stream, others select
fields, and some can do both:

    Select lines from text streams:

        grep, awk, sed, head, tail, look, uniq, comm, diff

    Select fields in lines or parts of lines:

        awk, sed, cut

    Transform text (change characters or words in lines):

        awk, sed, tr

The "sort" command is also useful for putting lines of text in order.

Become familiar with the data mining capabilities of the above commands.

-------------------
Tricks and Recipies
-------------------

* Extract the lines between two patterns:

    ... | grep -A 1000 "start pattern" | grep -B 1000 "end pattern"
    # The end pattern must be within 1000 lines of the start pattern in
    # the input stream.
 -OR-
    ... | sed -n -e '/start pattern/,/end pattern/p'
    # This above solution uses the "stream editor".

---------
Example 1
---------

Problem: "Print the fifth directory from your $PATH environment variable."

This problem can be generalized to print the fifth field of any input stream.

We will do an iterative solution built up slowly using simple commands.

First, we echo the PATH variable onto our screen:

    $ echo "$PATH"

Next, we convert the colons separating directories into newlines, so
that each directory is on a separate line.  We do this so that we can
later use "line selection" commands to select the fifth directory:

    $ echo "$PATH" | tr ':' '\n'

Now, we use a "line selection" command to select the first five lines:

    $ echo "$PATH" | tr ':' '\n' | head -5

Now, we use a "line selection" command to select the last line (of 5):

    $ echo "$PATH" | tr ':' '\n' | head -5 | tail -1

This is the answer - it is the fifth directory (the last line of the first
five lines).

Another solution:
-----------------

We can also do the same operation using the "field selection" commands
to extract the fifth field.

By default, "awk" separates fields by blanks; so, we need to turn the
colons in PATH into blanks.  Again, build up the command iteratively:

    $ echo "$PATH"
    $ echo "$PATH" | tr ':' ' '
    $ echo "$PATH" | tr ':' ' ' | awk '{print $5}'

However, "awk" has a convenient option to use another separator character:

    $ echo "$PATH" | awk -F: '{print $5}'

Even "sed" lets us pick off the fifth field separated by colons using
a "regular expression" pattern (though this is very messy!):

    $ echo "$PATH" | sed -e 's/^[^:]*:[^:]*:[^:]*:[^:]*:\([^:]*\):.*/\1/'

---------
Example 2
---------

Problem: "Print the second-to-last directory from your $PATH environment
         variable."

This can be generalized to print the second-to-last line of any input.

Use the same basic line-oriented form as the previous example, only
select the fields from the end of the list instead of the beginning.
Build up the command one-by-one:

    $ echo "$PATH"
    $ echo "$PATH" | tr ':' '\n'
    $ echo "$PATH" | tr ':' '\n' | tail -2
    $ echo "$PATH" | tr ':' '\n' | tail -2 | head -1

This is the answer - it is the second-to-last directory (the first line
of the last two lines).  We can also do the same operation using the
"field selection" commands to extract the fifth field.  Build up the
pipe line iteratively:

    $ echo "$PATH"
    $ echo "$PATH" | tr ':' ' '
    $ echo "$PATH" | tr ':' ' ' | awk '{print $(NF-1)}'
 Or:
    $ echo "$PATH" | awk -F: '{print $(NF-1)}'
 Or:
    $ echo "$PATH" | sed -e 's/^.*:\([^:]*\):[^:]*$/\1/'

Note the use of single quotes to protect the dollar signs in the
command-line script fragments from expansion the shell.  We want the
commands themselves to see the dollar symbols; we don't want the shell
to expand them.

---------
Example 3
---------

Problem: "Sort the elements in the PATH variable in ascending order."

This can be generalized to sort any field-delimited list.

Since the "sort" command only works on lines, not fields, we must
first transform the PATH into a list of directories, one per line:

    $ echo "$PATH"
    $ echo "$PATH" | tr ':' '\n'

Now, we can add the sort command:

    $ echo "$PATH" | tr ':' '\n' | sort

Now, we can put the line back together by changing all the newlines
back into colons:

    $ echo "$PATH" | tr ':' '\n' | sort | tr '\n' ':'

The above line adds an extra ":" on the end of the $PATH, which isn't
correct.  To get rid of the final colon:

    $ echo "$PATH" | tr ':' '\n' | sort | tr '\n' ':' | sed -e 's/:$//'

---------
Example 4
---------

Problem: "Keep only the first five elements of the PATH."

This can be generalized to keep the first N fields of any list.

We will again transform the fields of PATH into directories on separate
lines, select the first five lines, then put the directories back
together again.  Build up the pipeline iteratively, checking each step:

    $ echo "$PATH"
    $ echo "$PATH" | tr ':' '\n'
    $ echo "$PATH" | tr ':' '\n' | head -5 
    $ echo "$PATH" | tr ':' '\n' | head -5 | tr '\n' ':'
    $ echo "$PATH" | tr ':' '\n' | head -5 | tr '\n' ':' | sed -e 's/:$//'

Make sure to get rid of the trailing colon added by the final newline.

---------
Example 5
---------

Problem: "How many unique shells are in the /etc/passwd file?"

This can be generalized to output the unique fields in any data.

Build up the solution iteratively, starting with simple commands.

The shell is the seventh colon-delimited field in the passwd file.
The commands "awk", "sed", or "cut" can pick out a field from a file.
We will use "cut" to pick out the 7th field delimited by a colon.

Once we have only the 7th field being output, we can use "sort" and "uniq"
to reduce the repeated lines to only unique lines, and then count them.

Because the /etc/passwd file some machines is huge (and the output on
our screen would be huge), we will start making our pipeline with only
the first 10 lines of the passwd file until we know we have the correct
command line, then we will use the solution on the whole passwd file.

First, get 10 lines from the top of the passwd file:

    $ head /etc/passwd

Cut out only the seventh field in each line, delimited by a colon:

    $ head /etc/passwd | cut -d : -f 7

Sort the fields:

    $ head /etc/passwd | cut -d : -f 7 | sort

Reduce the output to unique lines:

    $ head /etc/passwd | cut -d : -f 7 | sort | uniq

Count the unique lines:

    $ head /etc/passwd | cut -d : -f 7 | sort | uniq | wc -l

We have the correct command line.  Now use the solution on the whole file:

    $ cat /etc/passwd | cut -d : -f 7 | sort | uniq | wc -l

Note that the "cut" command is quite capable of reading files itself -
there is no need to use a superfluous and unnecessary "cat" command to do it:

    $ cut -d : -f 7 /etc/passwd | sort | uniq | wc -l

The sort command also has a option that only outputs uniqe lines.
If we knew about it, we would write:

    $ cut -d : -f 7 /etc/passwd | sort -u | wc -l

Does the pipeline below (the reverse of the above) give the same output?

    $ sort -u /etc/passwd | cut -d : -f 7 | wc -l

When selecting lines and fields from a text stream, often the order in
which you do the selection matters.

---------
Example 7
---------

Problem: Extract the temperature for Ottawa from a weather web page.

Here is the web page that has the Ottawa weather:
    http://text.weatheroffice.gc.ca/forecast/city_e.html?on-118

We look at both the unformatted (wget) and formatted (lynx) versions of
the page, and decide which version is easiest to work with:

    $ url='http://text.weatheroffice.gc.ca/forecast/city_e.html?on-118'

    $ wget "$url"
    ...raw HTML page downloads here...

    $ lynx -dump "$url" >out.txt
    ...formatted web page is in out.txt...

The formatted page is probably easiest to work with.  The Temperature
we want is in this section of the page:

   [...]
   Condition:
          Sunny

   Temperature:
           -18.4°C

   Pressure / Tendency:
          102.0 kPa / rising
   [...]

The temperature we want to extract is on the line following the word
"Temperature:" in the file.  The grep command has an option to print
not only the line containing the pattern we want, but also any number
of lines following that line:

    $ lynx -dump "$url" | grep -A 1 "Temperature:"
       Temperature:
               -18.4°C

Sometimes the pattern we want matches more than one line in the weather
report, and we might get multiple pairs of output lines:

    $ lynx -dump "$url" | grep -A 1 -i "Wind"
       [31]Wind Chill:
              -26
    --
       Wind Speed:
              NW 13 km/h
    --
              Sunny. Wind up to 15 km/h. High minus 12. Wind chill minus 32
              this morning.

We can limit the output to just the first pair of lines matching:

    $ lynx -dump "$url" | grep -A 1 -i "Wind" | head -2
       [31]Wind Chill:
              -26

We will add that safety "head -2" to our pipeline, just to be sure we
only find the first pair of matching lines and not others that might
appear in the weather report on other days.  From that pair of lines
output, we only want the last line of the two:

    $ lynx -dump "$url" | grep -A 1 "Temperature:" | head -2 | tail -1
               -18.4°C

In a script, we would place the output into a variable for later use:

    temp=$( lynx -dump "$url" | grep -A 1 "Temperature:" | head -2 | tail -1 )
    [...]
    echo "The temperature in $city is: $temp"

------------------------------
Re-formatting data for parsing
------------------------------

Sometimes the data you are reading isn't nicely separated into lines on
which you can use "grep".  You can use the tr ("translate") command to
split up the file into separate lines, based on some list of delimiters.

For parsing HTML pages, it's often useful to split the long lines in
HTML pages on angle brackets and/or quotes.  This can put the data you
want to extract on separate lines so that grep can find it easily.

Let's try to extract the URL that has the Ottawa weather from the master
index page of weather in major Canadian cities:

    $ url='http://text.weatheroffice.gc.ca/canada_e.html'

    $ wget -q -O - "$url" | grep Ottawa
    <td headers="header1"><a href='/forecast/city_e.html?on-118&amp;unit=m'>Ottawa (Kanata - Orléans)</a></td><td class="t-left" headers="header2">Sunny</td><td class="t-right noWrap" headers="header3">-18&deg;C</td></tr>

That very long line is messy for getting at the URL for Ottawa, which
is inside single quotes in the middle of the line.  Let's split that
one long line into separate lines by turning angle-brackets into newlines:

    $ wget -q -O - "$url" | grep Ottawa | tr -s '<>' '\n'
    td headers="header1"
    a href='/forecast/city_e.html?on-118&amp;unit=m'
    Ottawa (Kanata - Orléans)
    /a
    /td
    td class="t-left" headers="header2"
    Sunny
    /td
    td class="t-right noWrap" headers="header3"
    -18&deg;C
    /td
    /tr

We want that /forecast/... URL without the single quotes and ampersand;
so, we also split the input on single quotes and ampersands:

    $ wget -q -O - "$url" | grep Ottawa | tr -s "<>'&" '\n'
    td headers="header1"
    a href=
    /forecast/city_e.html?on-118
    amp;unit=m
    Ottawa (Kanata - Orléans)
    /a
    /td
    td class="t-left" headers="header2"
    Sunny
    /td
    td class="t-right noWrap" headers="header3"
    -18
    deg;C
    /td
    /tr

Now, with the URL on its own line, it's easy to use grep to select just
the URL part of the line:

    $ wget -q -O - "$url" | grep Ottawa | tr -s "<>'&" '\n' | grep "^/forecast"
    /forecast/city_e.html?on-118

In a script, we usually save the output in a variable using shell Command
Substitution.  Later, we output an HTML hyperlink to the Ottawa weather page:

    wanturl=$( wget -q -O - "$url" | grep Ottawa | tr -s "<>'&" '\n' \
        | grep "^/forecast" )
    [...]
    fullURL="http:///text.weatheroffice.gc.ca$wanturl"
    echo "Link to Weather in <a href='$fullURL'> Ottawa </a>"

----------------------------------------------
Using cut vs. awk to extract fields from lines
----------------------------------------------

The "cut" command treats every occurrence of the delimiter as the
beginning of a new field.  This makes it wrong to use in many situations.

   $ echo "a b c" | awk '{print $2}'       # output is: b
   $ echo "a b c" | cut -d ' ' -f 2        # output is: b
   $ echo "a b c" | cut -d ' ' -f 3        # output is: c

   $ echo "a  b c" | awk '{print $2}'      # output is: b
   $ echo "a  b c" | cut -d ' ' -f 2       # output is:      # WHY?
   $ echo "a  b c" | cut -d ' ' -f 3       # output is: b    # WHY?

   $ echo "a   b c" | awk '{print $2}'     # output is: b
   $ echo "a   b c" | cut -d ' ' -f 2      # output is:      # WHY?
   $ echo "a   b c" | cut -d ' ' -f 3      # output is:      # WHY?
   $ echo "a   b c" | cut -d ' ' -f 4      # output is: b    # WHY?

For example, you might try to use "cut" to extract the current day from
the date string (though there are easier ways to get this information):

    $ date=$( date )
    $ echo "The date is $date"
    The date is Wed Oct 16 13:51:54 EDT 2008
    $ echo "$date" | cut -d ' ' -f 3
    16                                           # just the day number

This looks like it's working fine, until next month...

    $ date=$( date )
    $ echo "The date is $date"
    The date is Fri Nov  1 12:15:45 EDT 2008
    $ echo "$date" | cut -d ' ' -f 3
                                                 # empty!  WHY?
    $

Woops!  The extra blank in front of the day " 1" has caused "cut" to
come up with an empty third field.  This is not what we want.

The "awk" command behaves more reasonably.  By default, "awk" splits
up lines on any non-zero amount of whitespace (blanks and tabs), so
"awk" does not get confused by the extra blank:

    $ date=$( date )
    $ echo "The date is $date"
    The date is Fri Nov  1 12:15:45 EDT 2002
    $ echo "$date" | awk '{print $3}'
    1

This works much better.  "awk" doesn't care if there is one blank or
many blanks; it still divides the line up into the same number of fields.

Usually "awk" does what your eyes expect when extracting fields separated
by blanks.  Avoid "cut" for extracting fields unless you know that only
one delimiter exists between fields.