=================================================================== Using commands and pipes to "mine" and extract data from the system =================================================================== -Ian! D. Allen - idallen@idallen.ca - www.idallen.com Because of the power of Unix pipes and the rich set of command-line tools available, Unix programmers are often asked to extract or "mine" data from various text files. The "mining" operation can take many forms; but, a common form is to process a stream of text and extract certain fields from certain lines. One set of commands selects the lines to extract; the other set of commands picks off the desired fields from those lines (or vice-versa). Often these two operations are repeated, narrowing down the selection until just the desired information is displayed. Data mining is easy, if you build up the Unix pipeline slowly, adding one command at a time and watching the output each time. Some Unix commands select lines from a text stream, others select fields, and some can do both: Select lines from text streams: grep, awk, sed, head, tail, look, uniq, comm, diff Select fields in lines or parts of lines: awk, sed, cut Transform text (change characters or words in lines): awk, sed, tr The "sort" command is also useful for putting lines of text in order. Become familiar with the data mining capabilities of the above commands. --------- Example 1 --------- Problem: "Print the fifth directory from your $PATH environment variable." We will do an iterative solution built up slowly using simple commands. First, we echo the PATH variable onto our screen: $ echo "$PATH" Next, we convert the colons separating directories into newlines, so that each directory is on a separate line. We do this so that we can later use "line selection" commands to select the fifth directory: $ echo "$PATH" | tr ':' '\n' Now, we use a "line selection" command to select the first five lines: $ echo "$PATH" | tr ':' '\n' | head -5 Now, we use a "line selection" command to select the last line (of 5): $ echo "$PATH" | tr ':' '\n' | head -5 | tail -1 This is the answer - it is the fifth directory (the last line of the first five lines). We can also do the same operation using the "field selection" commands to extract the fifth field. By default, "awk" separates fields by blanks; so, we need to turn the colons in PATH into blanks: $ echo "$PATH" | tr ':' ' ' | awk '{print $5}' However, "awk" has an option to use another separator character: $ echo "$PATH" | awk -F: '{print $5}' --------- Example 2 --------- Problem: "Print the second-to-last directory from your $PATH environment variable." Use the same basic line-oriented form as the previous example, only select the fields from the end of the list instead of the beginning. Build up the command one-by-one: $ echo "$PATH" $ echo "$PATH" | tr ':' '\n' $ echo "$PATH" | tr ':' '\n' | tail -2 $ echo "$PATH" | tr ':' '\n' | tail -2 | head -1 This is the answer - it is the second-to-last directory (the first line of the last two lines). We can also do the same operation using the "field selection" commands to extract the fifth field. $ echo "$PATH" | tr ':' ' ' | awk '{print $(NF-1)}' Or: $ echo "$PATH" | awk -F: '{print $(NF-1)}' Note the use of single quotes to protect the dollar signs in the awk script fragment from expansion the shell. --------- Example 3 --------- Problem: "Sort the elements in the PATH variable in ascending order." Since the "sort" command only works on lines, not fields, we must first transform the PATH into a list of directories, one per line: $ echo "$PATH" $ echo "$PATH" | tr ':' '\n' Now, we can add the sort command: $ echo "$PATH" | tr ':' '\n' | sort Now, we can put the line back together by changing all the newlines back into colons: $ echo "$PATH" | tr ':' '\n' | sort | tr '\n' ':' The above line adds an extra ":" on the end of the $PATH, which isn't correct. To get rid of the final colon: $ echo "$PATH" | tr ':' '\n' | sort | tr '\n' ':' \ | sed -e 's/:$//' --------- Example 4 --------- Problem: "Keep only the first five elements of the PATH." We will again transform the fields of PATH into directories on separate lines, select the first five lines, then put the directories back together again: $ echo "$PATH" | tr ':' '\n' | head -5 | tr '\n' ':' \ | sed -e 's/:$//' Make sure to get rid of the trailing colon added by the final newline. --------- Example 5 --------- Problem: "How many unique shells are in the /etc/passwd file?" Build up the solution iteratively, starting with simple commands. The shell is the seventh colon-delimited field in the passwd file. The commands "awk", "sed", or "cut" can pick out a field from a file. We will use "cut" to pick out the 7th field delimited by a colon. Once we have only the 7th field being output, we can use "sort" and "uniq" to reduce the repeated lines to only unique lines, and then count them. Because the /etc/passwd file on ACADUNIX is huge (and the output on our screen would be huge), we will start making our pipeline with only the first 10 lines of the passwd file until we know we have the correct command line, then we will use the solution on the whole passwd file. First, get 10 lines from the top of the passwd file: $ head /etc/passwd Cut out only the seventh field in each line, delimited by a colon: $ head /etc/passwd | cut -d : -f 7 Sort the fields: $ head /etc/passwd | cut -d : -f 7 | sort Reduce the output to unique lines: $ head /etc/passwd | cut -d : -f 7 | sort | uniq Count the unique lines: $ head /etc/passwd | cut -d : -f 7 | sort | uniq | wc -l We have the correct command line. Now use the solution on the whole file: $ cat /etc/passwd | cut -d : -f 7 | sort | uniq | wc -l Note that the "cut" command is quite capable of reading files itself - there is no need to use a superfluous "cat" command to do it: $ cut -d : -f 7 /etc/passwd | sort | uniq | wc -l The sort command has a option that only outputs uniqe lines. If we knew about it, we would write: $ cut -d : -f 7 /etc/passwd | sort -u | wc -l Does the pipeline below (the reverse of the above) give the same output? $ sort -u /etc/passwd | cut -d : -f 7 | wc -l When selecting lines and fields from a text stream, often the order in which you do the selection matters. --------- Example 6 --------- The "cut" command treats every occurrence of the delimiter as the beginning of a new field. This makes it awkward to use in many situations. For example, you might try to use "cut" to extract the current day from the date string (though there are easier ways to get this information): $ date=$( date ) $ echo "The date is $date" The date is Wed Oct 16 13:51:54 EDT 2002 $ echo "$date" | cut -d ' ' -f 3 16 This looks like it's working fine, until next month... $ date=$( date ) $ echo "The date is $date" The date is Fri Nov 1 12:15:45 EDT 2002 $ echo "$date" | cut -d ' ' -f 3 $ Woops! The extra blank in front of the day " 1" has caused "cut" to come up with an empty third field. This is not what we want. The "awk" command behaves more reasonably. By default, "awk" splits up lines on any non-zero amount of whitespace (blanks and tabs), so "awk" does not get confused by the extra blank: $ date=$( date ) $ echo "The date is $date" The date is Fri Nov 1 12:15:45 EDT 2002 $ echo "$date" | awk '{print $3}' 1 This works much better. "awk" doesn't care if there is one blank or many blanks; it still divides the line up into the same number of fields.