----------------------- Exercise #6 for CST8129 due October 19, 2005 ----------------------- -Ian! D. Allen - idallen@idallen.ca Remember - knowing how to find out an answer is more important than memorizing the answer. Learn to fish! RTFM! (Read The Fine Manual) Global weight: 3% of your total mark this term Due date: Before the end of your Lab period on Wednesday, October 19. The online deliverables for this exercise are to be submitted online via the T127 Linux Lab using the submit method described in the exercise description, below. No paper; no email; no FTP. Late-submission date: I will accept without penalty online exercises that are submitted late but before 17h00 (5pm) on Thursday, October 20. After that late-submission date, the exercise is worth zero marks. Exercises submitted by the *due date* will be marked online and your marks will be sent to you by email after the late-submission date. This exercise is due before the end of your Lab period on October 19. Exercise Synopsis: Marks: 3% Write a shell script to data-mine job listings. Write a shell script to sort three integers. Where to work: Do your Unix command line work on any WT127 workstation. (You may login to the workstation remotely.) The files you work on will remain in your account after you log off. Do not erase your files after submission; always keep a spare copy of your exercises. WARNING: Do not attempt this exercise on a Windows machine - the text file format is different. You must connect to and work on Unix/Linux. Note that you may connect to a lab workstation *from* a Windows machine (using PuTTY); however, you may not use the Windows machine itself to do your work. Use the vim editor on the Linux machine. Location of the course notes on the Lab workstations: You can find a copy of all the course Notes files on any Lab workstation under directory: ~alleni/public_html/teaching/cst8129/05f/notes/ You can copy files from this directory to your own account for modification or study, if you like. (To avoid plagiarism charges, you must credit any material that you copy and submit unchanged as your own work.) Location of the textbook CDROM files on the Lab workstations: The CDROM files for the Quigley textbook are available in the WT127 Lab under the directory: /home/cst8129/ Exercise Preparation: A. Know where to find an online copy of all the course Notes on the Lab workstations. (See above.) You can get a copy of this exercise from the course notes. B. Complete the online Course Notes readings. Any questions? See me in a lab or post questions to the Discussion news group (on the top left of the Course Home Page). --------------------------------------------- Part I Exercise Details (in the T127 Linux Lab) --------------------------------------------- The USENET news group "ott.jobs" contains tens of thousands of local job listings. Today (Oct 19), I obtained a summary of these postings, with one line per posting. Your job is to answer the question: What words are most used in the "Subject" lines of these job postings? Presumably, the most used keywords are the most-needed skills that you should have. File Format ----------- The summary file I obtained contains 9 tab-separated fields. The "Subject" field is the second field in each line. The file is compressed. In the steps below, you will write a script to generate a sorted list of keywords from the subject lines, with the most used words at the top and the least used words at the bottom. 1. Create a file named "exercise06script1.sh" with these five lines in it: #!/bin/sh -u PATH=/bin:/usr/bin ; export PATH LC_COLLATE=C ; export LC_COLLATE LANG=C ; export LANG umask 022 The lines must be at the left margin, with no leading or trailing blanks or blank lines. The word count and checksum of the resulting file will be: $ wc exercise06script1.sh 5 16 110 exercise06script1.sh $ sum exercise06script1.sh 56814 1 2. Make the file executable: chmod +x exercise06script1.sh Make sure the file executes without errors: ./exercise06script1.sh (There will be no output from the file yet.) 3. Add your Assignment Label to the file as comment lines, below the /bin/sh line and above the PATH line. Make sure the first line of the script remains the shell interpreter line (as given above). 4. Copy the block of questions below into the end of the script file and add octothorpe comment characters ("#") in front of all the lines. Your file will now be in these sections, in this exact order: - shell interpreter line (comment) - Assignment Label (comments) - set PATH, LC_COLLATE, LANG, and umask - Questions 5 and up (comments) Execute the script file and make sure there are no errors and no output. (You only added comment lines - the file should produce no output.) Under each numbered question below, add commands, one by one, that will do the steps below, in order. You must make sure each command works at the command line before you copy it into the the script file and then test it by executing the file. (Hint: Work with two or three shell windows open.) Do not create any extra temporary files (other than the files explicitly named in this exercise). Use pipes to connect commands, not files. 5. Remove recursively any directory named "jobs6". 6. Create a new directory named "jobs6" in the current directory. 7. Change directories to make jobs6 the current directory. 9. Show on the screen the full pathname of the current directory. 10. Copy into the current directory the file jobs.txt.bz2 from subdirectory cst8129 under the home directory of userid "alleni". The checksum on this file is 55837. Display the checksum of this file (the file in the current directory). 11. Decompress the jobs.txt.bz2 file. The checksum on the decompressed jobs.txt file is 60053. Display the checksum of this file. 12. Extract just the subject field from every line and put the output into a new file named "subjects.txt". (See above for a description of the file format.) The subjects.txt file will contain 19,974 lines and the checksum will be 64010. Display the checksum of this file. 13. Change all upper-case letters in the subjects.txt file to lower-case and put the result into file "tmp". (Why can't you redirect the output directly back into the subjects.txt file?) Rename "tmp" to be "subjects.txt". The file should still contain 19,974 lines. The new checksum will be 30988. Display the checksum of this file. 14. Remove all duplicate lines from the subjects.txt file and put the result into file "uniquesubjects.txt". Display the word count of this file (it should be 12751 70388 601869 uniquesubjects.txt). 15. Translate anything in uniquesubjects.txt that is *NOT* a letter or digit into a newline character (which puts every word on a separate line), and generate a list of the unique counts of each line (each word), sorted with the most frequent line count at the top. Put the output into file keywords.txt. Display the word count (6376 12751 103041 keywords.txt) and checksum (42831) of the keywords.txt file. The script output should now look like this: 55837 718 60053 4534 64010 899 30988 899 12751 70388 601869 uniquesubjects.txt 6376 12751 103041 keywords.txt 42831 101 16. Remove all files except the keywords.txt file. 17. Display the top 15 lines of the keywords.txt file on the screen. Execute your file and make sure there are no errors: ./exercise06script1.sh Which keyword appears more often in the file, "cobol" or "graphics"? --------------------------------------------- Part II Exercise Details (in the T127 Linux Lab) --------------------------------------------- You will implement the PDL you wrote in Exercise #5. 1. Create a file named "exercise06script2.sh" with these four lines in it: #!/bin/sh -u PATH=/bin:/usr/bin ; export PATH LC_COLLATE=C ; export LC_COLLATE LANG=C ; export LANG umask 022 The lines must be at the left margin, with no leading or trailing blanks or blank lines. The word count and checksum of the resulting file will be: $ wc exercise06script2.sh 5 16 110 exercise06script2.sh $ sum exercise06script2.sh 56814 1 2. Make the file executable: chmod +x exercise06script2.sh Make sure the file executes without errors: ./exercise06script2.sh (There will be no output from the file yet.) 3. Add your Assignment Label to the file as comment lines, below the /bin/sh line and above the PATH line. Make sure the first line of the script remains the shell interpreter line (as given above). 4. Append your PDL from your previous exercise (Exercise #5) into the script file as comments. Your file will now be in these sections, in this exact order: - shell interpreter line (comment) - Assignment Label (comments) - set PATH, LC_COLLATE, LANG, and umask - PDL from Exercise #5 (comments) Execute the script file and make sure there are no errors and no output. (You only added comment lines - the file should produce no output.) 5. Underneath your PDL, implement your PDL using shell statements. Instead of prompting and reading the three numbers, the first three statements of your algorithm will copy the command line arguments into three shell variables. (See the script file commandline_arguments.sh.txt for a model and read shell_variables.txt for the details.) Use something like this: ="$1" ="$2" ="$3" Replace with your own variable names. Shell variables have basically the same naming rules as C language variables - start with a letter, no spaces, etc. No blanks are allowed around the "=" sign in shell assignments. The rest of your script will use the three variables you have defined. You may use other temporary variables if you wish. The last lines of your script will echo the sorted values to the screen, either one per line or all on one line (you choose). 6. Testing: You can test your sorting algorithm using different numbers given on the command line: $ ./exercise06script2.sh 1 3 2 1 2 3 $ ./exercise06script2.sh 200 300 100 100 200 300 $ ./exercise06script2.sh 11 1 111 1 11 111 etc. Note that if you do not give your script any values to sort, the "-u" option given to the shell on the first line will abort the script with an "unbound variable" error. You must always give the script three arguments. (More on dealing with missing arguments later.) Submission ---------- Submit the finished and labelled files for marking using the following Linux command line: $ ~alleni/bin/copy exercise06script1.sh exercise06script2.sh This program will copy the selected files to me for marking. You can copy the files more than once. Only the most recent copies will be marked. Always submit both files for marking at the same time. This exercise is due at the end of your lab period today. P.S. Did you spell all the label fields and file names correctly?