% Useful programs: gzip,bzip2 - compression; tar,zip - file archiving; diff - comparison % Ian! D. Allen -- -- [www.idallen.com] % Winter 2015 - January to Apil 2015 - Updated 2019-03-08 04:21 EST - [Course Home Page] - [Course Outline] - [All Weeks] - [Plain Text] File compression: `gzip` and `gunzip` ===================================== You can compress a file using the `gzip` command, and the result is a new binary compressed file with a `.gz` suffix added on the end: $ cp -p /etc/passwd foo $ gzip foo $ ls -ls /etc/passwd foo.gz 96 -rw-r--r-- 1 root root 97450 Feb 10 13:08 /etc/passwd 28 -rw-r--r-- 1 idallen idallen 26884 Feb 10 13:08 foo.gz $ file foo.gz foo.gz: gzip compressed data, was "foo", from Unix, last modified: Wed Feb 10 13:08:27 2016 The original file is removed after being compressed. The modify time of the original file is preserved. You can decompress/uncompress the file with `gunzip`, which restores the original file contents and removes the suffix from the name: $ gunzip foo.gz # "gunzip foo" works too $ ls -ls foo 96 -rw-r--r-- 1 idallen idallen 97450 Feb 10 13:08 foo The compressed file is removed after being uncompressed. The modify time of the file is preserved. The `gunzip` command will not uncompress a file by name unless the file name ends in the `.gz` suffix: $ gzip foo $ file foo foo: gzip compressed data, last modified: Wed Mar 6 21:13:03 2019, from Unix $ gunzip foo gzip: foo: unknown suffix -- ignored $ mv foo foo.gz $ gunzip foo.gz $ ls -l /etc/passwd foo -rw-r--r-- 1 root root 168835 Mar 6 16:13 /etc/passwd -rw-rw-r-- 1 idallen idallen 168835 Mar 8 04:03 foo Using filters (no file names) ----------------------------- You can use either command as a filter (reading standard input and writing standard output) if you don't give it a file name: $ fgrep 'refused connect' /var/log/auth.log | gzip >bad.txt.gz $ gunzip foo $ file foo foo: bzip2 compressed data, block size = 900k $ bunzip2 foo bunzip2: Can't guess original name for foo -- using foo.out $ ls -l /etc/passwd foo.out -rw-r--r-- 1 root root 168835 Mar 6 16:13 /etc/passwd -rw-rw-r-- 1 idallen idallen 168835 Mar 8 04:05 foo.out Helpers: `bzless bzfgrep bzcat bzdiff bzgrep` --------------------------------------------- Some helpful `bz`-commands have been created to directly access compressed files and save typing `bunzip2` in a pipe all the time: `bzcat bzdiff bzfgrep bzgrep bzless`: $ bunzip2 Read the mouse-over text in the above [`tar`-related comic] > from the [XKCD] webcomic. Long before software package managers such as YUM, RPM, and APT, there were `tar` archives. Originally written as a magnetic Tape ARchiver, the command is common to every Unix/Linux system. A `tar` archive file is the Unix version of a `zip` file. It is one file that contains many other files inside it. You can download and extract a `tar` format archive file on most any Unix/Linux system back to 1969. ![[Unix tar license plate][Disarm the bomb with a Unix tar command line]][2] A `tar` archive, also called a "*tarball*", is a single file that contains multiple uncompressed files and directories. Unix/Linux software source is often distributed as a "tarball". The syntax of the `tar` command is irregular -- you don't have to put dashes in front of the operation letters (but you can if you like): Syntax: tar [options] -f [] $ tar cf /tmp/my.tar . # create archive of current directory $ tar -cf stuff.tar *.c # archive all the .c files $ tar -xvf my.tar # extract everything into current dir $ tar xvf my.tar mydir # only extract mydir from the archive The name of the `tar` archive can be anything; the suffixes are there simply for human readers to better know what the files contain. The archive name must always directly follow the `-f` option with no other option letters in between: $ tar -tvf my.tar # correct use of -f $ tar -vft my.tar # WRONG use of -f $ tar -fvt my.tar # WRONG use of -f You must always use one of three major operation letters: -t: list the pathnames in the archive (a table of contents) -x: extract (all or some) pathnames from the archive -c: create a new tar archive (erases existing contents!) You may optionally use some other relevant options: -f: select the archive pathname (almost always used; must be last option) -p: preserve permissions when extracting -v: verbose (more messages about what is happening, or more detail) -z: the entire archive is gzip compressed (or uncompressed if extracting) -j: the entire archive is bzip2 compressed (or uncompressed if extracting) The `-f` archive pathname option is almost *always* used, unless you happen to own a tape drive! Always use `-f` and an archive file name. The archive file name *must immediately follow* the `-f` option with no other option letters in between, i.e. `tar -tvf my.tar` The `-v` "verbose" option above lists all the file names as they are put into an archive file, or as they are extracted. This is useful for debugging, but isn't usually used for a production system where you know exactly what is going into the archive; leave it out for normal use. If an uncompressed *tarball* file is damaged, the damage may affect only some of the files in the *tarball* and the other files, even files stored after the damage point, may still be recoverable. Compressed tarballs: `tarball.tar.gz` and `tarball.tar.bz2` ----------------------------------------------------------- A compressed *tarball* is simply a single *tarball* file that has been compressed with either `gzip` or `bzip2`. The compression compresses the entire *tarball*, not the individual files inside the *tarball*. A *tarball* file may be first created and then compressed *as a whole* using either the `gzip` or `bzip2` file compression commands: $ tar -cf tarball.tar *.c # create archive named tarball.tar $ gzip tarball.tar # compress into tarball.tar.gz $ tar -cf tarball.tar *.c # create archive named tarball.tar $ bzip2 tarball.tar # compress into tarball.tar.bz2 Modern versions of `tar` have an option letter that does this compression for you (less typing). A compressed `tar` archive can be created and compressed in one step by an option to the `tar` command itself: $ tar -czf tarball.tar.gz *.c # create and gzip compress into tarball.tar.gz $ tar -cjf tarball.tar.bz2 *.c # create and bzip2 compress into tarball.tar.bz2 You generate a table of contents, or extract all the files, using the appropriate de-compression option depending on if and how the *tarball* file was compressed: $ tar -tf tarball.tar # table of contents if uncompressed $ tar -tzf tarball.tar.gz # table of contents if gzip compressed $ tar -tjf tarball.tar.bz2 # table of contents if bzip2 compressed $ tar -xf tarball.tar # extract contents (uncompressed) $ tar -xzf tarball.tar.gz # extract contents (gzip compressed) $ tar -xjf tarball.tar.bz2 # extract contents (bzip2 compressed) The `tar` command doesn't care what you name your archive file. The `gzip` compressed *tarballs* usually have names ending with `*.tar.gz` or `*.tgz` and `bzip2` compressed *tarballs* usually have names ending with `*.tar.bz2` or `*.tb2`. > Modern versions of the `tar` command automatically recognize existing > compressed archives and thus don't require the extra `z` or `j` option > letters to read compressed archives. You still need the appropriate letter > to create a new compressed archive file. If a compressed *tarball* file is damaged, **all** the files following the damage point cannot be decompressed and are usually unrecoverable. Using `tar` to archive or restore a directory --------------------------------------------- The `tar` command will automatically recursively archive entire directories into a tarball if you give it directories. Software is often distributed as a tarball file. $ cd # go to my home directory $ tar czf /tmp/homedir.tar.gz . # archive current directory into a file *Do not place the output tarball file in any of the directories being used as input to `tar`!* When you have a tarball, you can then extract it into the current directory: $ mkdir /some/backupdir $ cd /some/backupdir $ tar xzpf /tmp/homedir.tar.gz # extract the whole archive into current directory The `p` option preserves the modes (permissions) of the files as they are extracted. ### Legacy: Using `tar` to copy a directory *This legacy use of `tar` to copy an entire directory has been replaced by `cp -a` or the `rsync` command.* You can do a directory copy with `tar` using a pipe instead of an output file by using the special file name `-` that stands for either standard output (when creating) or standard input (when extracting): $ cd $ tar cf - . | ( cd /some/backupdir && tar xpf - ) # local copy $ tar cf - . | ( ssh otherhost 'cd /some/dir && tar xpf -' ) # remote host copy The above uses of `tar` to copy a directory have been largely supplanted by the `-a` (archive) option to `cp` or by the `rsync` command. ZIP archives: `zip` and `unzip` =============================== A ZIP file is a single file containing individually compressed files. (This is not the same format as a compressed *tarball*, which is a single compressed file containing individual uncompressed files.) Unix/Linux can also manipulate ZIP format file archives (often used on Microsoft systems) using `zip` and `unzip`: $ touch file1 file2 file3 $ zip foo file1 file2 file3 # create foo.zip with three files adding: file1 (stored 0%) adding: file2 (stored 0%) adding: file3 (stored 0%) $ ls -l foo.zip -rw-rw-r-- 1 idallen idallen 436 Mar 9 03:44 foo.zip $ unzip -l foo.zip # list the contents (do not extract) Archive: foo.zip Length Date Time Name --------- ---------- ----- ---- 0 2016-03-09 03:44 file1 0 2016-03-09 03:44 file2 0 2016-03-09 03:44 file3 --------- ------- 0 3 files $ rm file? $ unzip foo.zip # extract all the files Archive: foo.zip extracting: file1 extracting: file2 extracting: file3 Other options can preserve directory hierarchy and do other things. See the man page. If a ZIP file is damaged, the damage usually affects only some of the files in the ZIP file and the other files, even files stored after the damage point, may still recoverable. Differences between ZIP and TAR =============================== - Q: Is a *tarball* an archive of separate, individually compressed files (which is also the structure of a `zip` file), or does `tar` archive together all the files first (uncompressed) and then compress the whole archive? - Q: Which is more space-efficient: a `zip` file or a compressed `tar` file, and why? (Hint: Consider archiving 1000 copies of the same file.) - Q: Which is less affected by file damage: a `zip` file or a compressed `tar` file, and why? Differences between text files: `diff` ====================================== The `diff` command compares two files: `diff file1 file1` - - - See Also: `vimdiff` and `gvimdiff` - See Also: `diff3` - For systems running X Windows, see also `meld` Handling Unix/Linux archives and compressed files under Microsoft Windows ========================================================================= Student Tammy Rediger (17F) tells me that "the program **7zip** does work with `.gz`, `.bzip2` and `.tar` files" under Microsoft Windows. Links ===== - -- | Ian! D. Allen, BA, MMath - idallen@idallen.ca - Ottawa, Ontario, Canada | Home Page: http://idallen.com/ Contact Improv: http://contactimprov.ca/ | College professor (Free/Libre GNU+Linux) at: http://teaching.idallen.com/ | Defend digital freedom: http://eff.org/ and have fun: http://fools.ca/ [Plain Text] - plain text version of this page in [Pandoc Markdown] format [www.idallen.com]: http://www.idallen.com/ [Course Home Page]: .. [Course Outline]: course_outline.pdf [All Weeks]: indexcgi.cgi [Plain Text]: 520_package_management.txt [Disarm the bomb with a Unix tar command line]: http://xkcd.com/1168/ [1]: http://imgs.xkcd.com/comics/tar.png "Disarm the bomb with a Unix tar command line" [`tar`-related comic]: http://xkcd.net/1168/ [XKCD]: http://xkcd.com/ [2]: data/tar_xvf_small.jpg "Unix tar license plate]" [Pandoc Markdown]: http://johnmacfarlane.net/pandoc/