-----------------------
Exercise #6 for CST8129 due October 19, 2005
-----------------------
-Ian! D. Allen - idallen@idallen.ca

Remember - knowing how to find out an answer is more important than
memorizing the answer.  Learn to fish!  RTFM!  (Read The Fine Manual)

Global weight: 3% of your total mark this term
Due date: Before the end of your Lab period on Wednesday, October 19.

The online deliverables for this exercise are to be submitted online
via the T127 Linux Lab using the submit method described in the exercise
description, below.  No paper; no email; no FTP.

Late-submission date: I will accept without penalty online exercises that
are submitted late but before 17h00 (5pm) on Thursday, October 20.
After that late-submission date, the exercise is worth zero marks.

Exercises submitted by the *due date* will be marked online and your
marks will be sent to you by email after the late-submission date.

This exercise is due before the end of your Lab period on October 19.

Exercise Synopsis:

Marks: 3%

    Write a shell script to data-mine job listings.
    Write a shell script to sort three integers.

Where to work:
    Do your Unix command line work on any WT127 workstation.  (You may
    login to the workstation remotely.)  The files you work on will
    remain in your account after you log off.  Do not erase your files
    after submission; always keep a spare copy of your exercises.

    WARNING: Do not attempt this exercise on a Windows machine - the text
    file format is different.  You must connect to and work on Unix/Linux.
    Note that you may connect to a lab workstation *from* a Windows
    machine (using PuTTY); however, you may not use the Windows machine
    itself to do your work.  Use the vim editor on the Linux machine.

Location of the course notes on the Lab workstations:
    You can find a copy of all the course Notes files on any Lab
    workstation under directory:
       ~alleni/public_html/teaching/cst8129/05f/notes/
    You can copy files from this directory to your own account for
    modification or study, if you like.  (To avoid plagiarism charges,
    you must credit any material that you copy and submit unchanged as
    your own work.)

Location of the textbook CDROM files on the Lab workstations:
    The CDROM files for the Quigley textbook are available in the
    WT127 Lab under the directory:  /home/cst8129/

Exercise Preparation:

A.  Know where to find an online copy of all the course Notes on the
    Lab workstations.  (See above.)  You can get a copy of this
    exercise from the course notes.

B.  Complete the online Course Notes readings.
    Any questions?  See me in a lab or post questions to the Discussion
    news group (on the top left of the Course Home Page).

---------------------------------------------
Part I Exercise Details (in the T127 Linux Lab)
---------------------------------------------

The USENET news group "ott.jobs" contains tens of thousands of local
job listings.  Today (Oct 19), I obtained a summary of these postings,
with one line per posting.  Your job is to answer the question: What words
are most used in the "Subject" lines of these job postings?  Presumably,
the most used keywords are the most-needed skills that you should have.

File Format
-----------

    The summary file I obtained contains 9 tab-separated fields.
    The "Subject" field is the second field in each line.
    The file is compressed.

In the steps below, you will write a script to generate a sorted list of
keywords from the subject lines, with the most used words at the top
and the least used words at the bottom.

1.  Create a file named "exercise06script1.sh" with these five lines in it:

    #!/bin/sh -u
    PATH=/bin:/usr/bin ; export PATH
    LC_COLLATE=C ; export LC_COLLATE
    LANG=C ; export LANG
    umask 022

    The lines must be at the left margin, with no leading or trailing blanks
    or blank lines.  The word count and checksum of the resulting file will be: 

    $ wc exercise06script1.sh
      5 16 110 exercise06script1.sh

    $ sum exercise06script1.sh 
    56814     1

2.  Make the file executable:   chmod +x exercise06script1.sh
    Make sure the file executes without errors:   ./exercise06script1.sh
    (There will be no output from the file yet.)

3.  Add your Assignment Label to the file as comment lines, below the
    /bin/sh line and above the PATH line.  Make sure the first line of 
    the script remains the shell interpreter line (as given above).

4.  Copy the block of questions below into the end of the script file and
    add octothorpe comment characters ("#") in front of all the lines.

Your file will now be in these sections, in this exact order:

    - shell interpreter line (comment)
    - Assignment Label (comments)
    - set PATH, LC_COLLATE, LANG, and umask
    - Questions 5 and up (comments)

Execute the script file and make sure there are no errors and no output.
(You only added comment lines - the file should produce no output.)

Under each numbered question below, add commands, one by one, that will do
the steps below, in order.  You must make sure each command works at the
command line before you copy it into the the script file and then test it
by executing the file.  (Hint: Work with two or three shell windows open.)

Do not create any extra temporary files (other than the files explicitly
named in this exercise).  Use pipes to connect commands, not files.

5.  Remove recursively any directory named "jobs6".

6.  Create a new directory named "jobs6" in the current directory.

7.  Change directories to make jobs6 the current directory.

9.  Show on the screen the full pathname of the current directory.

10. Copy into the current directory the file jobs.txt.bz2 from subdirectory
    cst8129 under the home directory of userid "alleni".  The checksum on
    this file is 55837.  Display the checksum of this file (the file in the
    current directory).

11. Decompress the jobs.txt.bz2 file.  The checksum on the decompressed
    jobs.txt file is 60053.  Display the checksum of this file.

12. Extract just the subject field from every line and put the output into
    a new file named "subjects.txt".  (See above for a description of the
    file format.)  The subjects.txt file will contain 19,974 lines and the
    checksum will be 64010.  Display the checksum of this file.

13. Change all upper-case letters in the subjects.txt file to lower-case
    and put the result into file "tmp".  (Why can't you redirect the
    output directly back into the subjects.txt file?)  Rename "tmp" to be
    "subjects.txt".  The file should still contain 19,974 lines.  The new
    checksum will be 30988.  Display the checksum of this file.

14. Remove all duplicate lines from the subjects.txt file and put the result
    into file "uniquesubjects.txt".  Display the word count of this file
    (it should be 12751 70388 601869 uniquesubjects.txt).

15. Translate anything in uniquesubjects.txt that is *NOT* a letter or digit
    into a newline character (which puts every word on a separate line),
    and generate a list of the unique counts of each line (each word), sorted
    with the most frequent line count at the top.  Put the output into file
    keywords.txt.  Display the word count (6376 12751 103041 keywords.txt)
    and checksum (42831) of the keywords.txt file.

    The script output should now look like this:

    55837   718
    60053  4534
    64010   899
    30988   899
     12751  70388 601869 uniquesubjects.txt
      6376  12751 103041 keywords.txt
    42831   101

16. Remove all files except the keywords.txt file.

17. Display the top 15 lines of the keywords.txt file on the screen.

Execute your file and make sure there are no errors:   ./exercise06script1.sh

Which keyword appears more often in the file, "cobol" or "graphics"?

---------------------------------------------
Part II Exercise Details (in the T127 Linux Lab)
---------------------------------------------

You will implement the PDL you wrote in Exercise #5.

1.  Create a file named "exercise06script2.sh" with these four lines in it:

    #!/bin/sh -u
    PATH=/bin:/usr/bin ; export PATH
    LC_COLLATE=C ; export LC_COLLATE
    LANG=C ; export LANG
    umask 022

    The lines must be at the left margin, with no leading or trailing blanks
    or blank lines.  The word count and checksum of the resulting file will be: 

    $ wc exercise06script2.sh
      5 16 110 exercise06script2.sh

    $ sum exercise06script2.sh 
    56814     1

2.  Make the file executable:   chmod +x exercise06script2.sh
    Make sure the file executes without errors:   ./exercise06script2.sh
    (There will be no output from the file yet.)

3.  Add your Assignment Label to the file as comment lines, below the
    /bin/sh line and above the PATH line.  Make sure the first line of 
    the script remains the shell interpreter line (as given above).

4.  Append your PDL from your previous exercise (Exercise #5) into the
    script file as comments.

Your file will now be in these sections, in this exact order:

    - shell interpreter line (comment)
    - Assignment Label (comments)
    - set PATH, LC_COLLATE, LANG, and umask
    - PDL from Exercise #5 (comments)

Execute the script file and make sure there are no errors and no output.
(You only added comment lines - the file should produce no output.)

5.  Underneath your PDL, implement your PDL using shell statements. 

    Instead of prompting and reading the three numbers, the first
    three statements of your algorithm will copy the command line
    arguments into three shell variables.  (See the script file
    commandline_arguments.sh.txt for a model and read shell_variables.txt
    for the details.)  Use something like this:

    <your var name1>="$1"
    <your var name2>="$2"
    <your var name3>="$3"

    Replace <your var name?> with your own variable names.  Shell
    variables have basically the same naming rules as C language variables
    - start with a letter, no spaces, etc.  No blanks are allowed around
    the "=" sign in shell assignments.

    The rest of your script will use the three variables you have defined.
    You may use other temporary variables if you wish.

    The last lines of your script will echo the sorted values to the
    screen, either one per line or all on one line (you choose).

6.  Testing:  You can test your sorting algorithm using different
    numbers given on the command line:

    $ ./exercise06script2.sh  1  3  2
    1 2 3

    $ ./exercise06script2.sh  200  300  100
    100 200 300

    $ ./exercise06script2.sh  11  1  111
    1 11 111

    etc.

Note that if you do not give your script any values to sort, the "-u"
option given to the shell on the first line will abort the script with
an "unbound variable" error.  You must always give the script three
arguments.  (More on dealing with missing arguments later.)

Submission
----------

Submit the finished and labelled files for marking using the following
Linux command line:

       $ ~alleni/bin/copy exercise06script1.sh exercise06script2.sh 

This program will copy the selected files to me for marking.  You can
copy the files more than once.  Only the most recent copies will be marked.
Always submit both files for marking at the same time.

This exercise is due at the end of your lab period today.

P.S.  Did you spell all the label fields and file names correctly?