Useful programs: gzip,bzip2 - compression; tar,zip - file archiving; diff - comparison

Ian! D. Allen – www.idallen.com

Winter 2018 - January to April 2018 - Updated 2019-03-08 04:21 EST

1 File compression: gzip and gunzipIndexup to index

You can compress a file using the gzip command, and the result is a new binary compressed file with a .gz suffix added on the end:

$ cp -p /etc/passwd foo
$ gzip foo
$ ls -ls /etc/passwd foo.gz 
96 -rw-r--r-- 1 root     root     97450 Feb 10 13:08 /etc/passwd
28 -rw-r--r-- 1 idallen  idallen  26884 Feb 10 13:08 foo.gz
$ file foo.gz
foo.gz: gzip compressed data, was "foo", from Unix, last modified: Wed Feb 10 13:08:27 2016

The original file is removed after being compressed. The modify time of the original file is preserved.

You can decompress/uncompress the file with gunzip, which restores the original file contents and removes the suffix from the name:

$ gunzip foo.gz                       # "gunzip foo" works too
$ ls -ls foo
96 -rw-r--r-- 1 idallen idallen 97450 Feb 10 13:08 foo

The compressed file is removed after being uncompressed. The modify time of the file is preserved.

The gunzip command will not uncompress a file by name unless the file name ends in the .gz suffix:

$ gzip </etc/passwd >foo
$ file foo
foo: gzip compressed data, last modified: Wed Mar  6 21:13:03 2019, from Unix
$ gunzip foo
gzip: foo: unknown suffix -- ignored
$ mv foo foo.gz
$ gunzip foo.gz
$ ls -l /etc/passwd foo
-rw-r--r-- 1 root     root     168835 Mar  6 16:13 /etc/passwd
-rw-rw-r-- 1 idallen  idallen  168835 Mar  8 04:03 foo

1.1 Using filters (no file names)Indexup to index

You can use either command as a filter (reading standard input and writing standard output) if you don’t give it a file name:

$ fgrep 'refused connect' /var/log/auth.log | gzip >bad.txt.gz
$ gunzip <bad.txt.gz | wc
$ gunzip <bad.txt.gz | less

When used as a filter (no file name), the commands cannot actually compress or decompress the original file and remove it because there is no file name. Filter commands simply compress or decompress the data in the input stream; the file is not changed.

1.2 Helpers: zless zfgrep zcat zdiff zgrepIndexup to index

Some helpful z-commands have been created to directly access compressed files and save typing gunzip in a pipe all the time:

$ gunzip <bad.txt.gz | less           # hard way to paginate contents
$ zless bad.txt.gz                    # easy way
$ gunzip <bad.txt.gz | fgrep '.cn'    # hard way to fgrep contents
$ zfgrep '.cn' bad.txt.gz             # easy way

Since all of the z-commands are filters (they are small shell scripts), none of the z-commands affect the given file. The file is not decompressed and then removed. Only the file contents are decompressed and sent to standard output.

See also: zcat zdiff zgrep

2 File compression: bzip2 and bunzip2Indexup to index

The commands bzip2 and bunzip2 are similar to gzip and gunzip but they use a different, often better, compression algorithm. The default file extension is .bz2 instead of .gz:

$ cp /etc/passwd foo
$ bzip2 foo
$ ls -ls /etc/passwd foo.bz2 foo.gz
96 -rw-r--r-- 1 root     root     97450 Feb 10 13:08 /etc/passwd
24 -rw-r--r-- 1 idallen  idallen  22235 Feb 10 13:08 foo.bz2
28 -rw-r--r-- 1 idallen  idallen  26884 Feb 10 13:08 foo.gz
$ file foo.bz2
foo.bz2: bzip2 compressed data, block size = 900k

As with gzip, the original file is removed after being compressed, unless the command is used as a filter (without a file name). The modify time of the original file is preserved.

If you give bunzip2 a file name that does not end in .bz2, it decompresses the file into the same file name with .out appended:

$ bzip2 </etc/passwd >foo
$ file foo
foo: bzip2 compressed data, block size = 900k
$ bunzip2 foo
bunzip2: Can't guess original name for foo -- using foo.out
$ ls -l /etc/passwd foo.out
-rw-r--r-- 1 root     root     168835 Mar  6 16:13 /etc/passwd
-rw-rw-r-- 1 idallen  idallen  168835 Mar  8 04:05 foo.out

2.1 Helpers: bzless bzfgrep bzcat bzdiff bzgrepIndexup to index

Some helpful bz-commands have been created to directly access compressed files and save typing bunzip2 in a pipe all the time: bzcat bzdiff bzfgrep bzgrep bzless:

$ bunzip2 <bad.txt.bz2 | less         # hard way to paginate contents
$ bzless bad.txt.bz2                  # easy way

These helpers have similar names and work the same way as the gzip helper z-commands. See the man pages for the other helpers.

3 Unix/Linux tar file (tarball)Indexup to index

Disarm the bomb with a Unix tar command line

Read the mouse-over text in the above tar-related comic from the XKCD webcomic.

Long before software package managers such as YUM, RPM, and APT, there were tar archives. Originally written as a magnetic Tape ARchiver, the command is common to every Unix/Linux system. A tar archive file is the Unix version of a zip file. It is one file that contains many other files inside it. You can download and extract a tar format archive file on most any Unix/Linux system back to 1969.

Unix tar license plate A tar archive, also called a “tarball”, is a single file that contains multiple uncompressed files and directories. Unix/Linux software source is often distributed as a “tarball”.

The syntax of the tar command is irregular – you don’t have to put dashes in front of the operation letters (but you can if you like):

Syntax: tar <operation> [options] -f <archive_file> [<pathnames>]
$ tar cf /tmp/my.tar .                # create archive of current directory
$ tar -cf stuff.tar *.c               # archive all the .c files
$ tar -xvf my.tar                     # extract everything into current dir
$ tar xvf my.tar mydir                # only extract mydir from the archive

The name of the tar archive can be anything; the suffixes are there simply for human readers to better know what the files contain.

The archive name must always directly follow the -f option with no other option letters in between:

$ tar -tvf my.tar                      # correct use of -f
$ tar -vft my.tar                      # WRONG use of -f
$ tar -fvt my.tar                      # WRONG use of -f

You must always use one of three major operation letters:

-t: list the pathnames in the archive (a table of contents)
-x: extract (all or some) pathnames from the archive
-c: create a new tar archive (erases existing contents!)

You may optionally use some other relevant options:

-f: select the archive pathname (almost always used; must be last option)
-p: preserve permissions when extracting
-v: verbose (more messages about what is happening, or more detail)
-z: the entire archive is gzip compressed (or uncompressed if extracting)
-j: the entire archive is bzip2 compressed (or uncompressed if extracting)

The -f archive pathname option is almost always used, unless you happen to own a tape drive! Always use -f and an archive file name. The archive file name must immediately follow the -f option with no other option letters in between, i.e. tar -tvf my.tar

The -v “verbose” option above lists all the file names as they are put into an archive file, or as they are extracted. This is useful for debugging, but isn’t usually used for a production system where you know exactly what is going into the archive; leave it out for normal use.

If an uncompressed tarball file is damaged, the damage may affect only some of the files in the tarball and the other files, even files stored after the damage point, may still be recoverable.

3.1 Compressed tarballs: tarball.tar.gz and tarball.tar.bz2Indexup to index

A compressed tarball is simply a single tarball file that has been compressed with either gzip or bzip2. The compression compresses the entire tarball, not the individual files inside the tarball.

A tarball file may be first created and then compressed as a whole using either the gzip or bzip2 file compression commands:

$ tar -cf tarball.tar *.c             # create archive named tarball.tar
$ gzip tarball.tar                    # compress into tarball.tar.gz

$ tar -cf tarball.tar *.c             # create archive named tarball.tar
$ bzip2 tarball.tar                   # compress into tarball.tar.bz2

Modern versions of tar have an option letter that does this compression for you (less typing). A compressed tar archive can be created and compressed in one step by an option to the tar command itself:

$ tar -czf tarball.tar.gz *.c         # create and gzip compress into tarball.tar.gz

$ tar -cjf tarball.tar.bz2 *.c        # create and bzip2 compress into tarball.tar.bz2

You generate a table of contents, or extract all the files, using the appropriate de-compression option depending on if and how the tarball file was compressed:

$ tar -tf tarball.tar                 # table of contents if uncompressed
$ tar -tzf tarball.tar.gz             # table of contents if gzip compressed
$ tar -tjf tarball.tar.bz2            # table of contents if bzip2 compressed

$ tar -xf tarball.tar                 # extract contents (uncompressed)
$ tar -xzf tarball.tar.gz             # extract contents (gzip compressed)
$ tar -xjf tarball.tar.bz2            # extract contents (bzip2 compressed)

The tar command doesn’t care what you name your archive file. The gzip compressed tarballs usually have names ending with *.tar.gz or *.tgz and bzip2 compressed tarballs usually have names ending with *.tar.bz2 or *.tb2.

Modern versions of the tar command automatically recognize existing compressed archives and thus don’t require the extra z or j option letters to read compressed archives. You still need the appropriate letter to create a new compressed archive file.

If a compressed tarball file is damaged, all the files following the damage point cannot be decompressed and are usually unrecoverable.

3.2 Using tar to archive or restore a directoryIndexup to index

The tar command will automatically recursively archive entire directories into a tarball if you give it directories. Software is often distributed as a tarball file.

$ cd                                  # go to my home directory
$ tar czf /tmp/homedir.tar.gz .       # archive current directory into a file

Do not place the output tarball file in any of the directories being used as input to tar!

When you have a tarball, you can then extract it into the current directory:

$ mkdir /some/backupdir
$ cd /some/backupdir
$ tar xzpf /tmp/homedir.tar.gz        # extract the whole archive into current directory

The p option preserves the modes (permissions) of the files as they are extracted.

3.2.1 Legacy: Using tar to copy a directoryIndexup to index

This legacy use of tar to copy an entire directory has been replaced by cp -a or the rsync command.

You can do a directory copy with tar using a pipe instead of an output file by using the special file name - that stands for either standard output (when creating) or standard input (when extracting):

$ cd
$ tar cf - . | ( cd /some/backupdir && tar xpf - )            # local copy
$ tar cf - . | ( ssh otherhost 'cd /some/dir && tar xpf -' )  # remote host copy

The above uses of tar to copy a directory have been largely supplanted by the -a (archive) option to cp or by the rsync command.

4 ZIP archives: zip and unzipIndexup to index

A ZIP file is a single file containing individually compressed files. (This is not the same format as a compressed tarball, which is a single compressed file containing individual uncompressed files.)

Unix/Linux can also manipulate ZIP format file archives (often used on Microsoft systems) using zip and unzip:

$ touch file1 file2 file3
$ zip foo file1 file2 file3           # create foo.zip with three files
adding: file1 (stored 0%)
adding: file2 (stored 0%)
adding: file3 (stored 0%)
$ ls -l foo.zip
-rw-rw-r-- 1 idallen idallen 436 Mar  9 03:44 foo.zip
$ unzip -l foo.zip                    # list the contents (do not extract)
Archive:  foo.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2016-03-09 03:44   file1
        0  2016-03-09 03:44   file2
        0  2016-03-09 03:44   file3
---------                     -------
        0                     3 files
$ rm file?
$ unzip foo.zip                       # extract all the files
Archive:  foo.zip
 extracting: file1                   
 extracting: file2                   
 extracting: file3                   

Other options can preserve directory hierarchy and do other things. See the man page.

If a ZIP file is damaged, the damage usually affects only some of the files in the ZIP file and the other files, even files stored after the damage point, may still recoverable.

5 Differences between ZIP and TARIndexup to index

6 Differences between text files: diffIndexup to index

The diff command compares two files: diff file1 file1

7 Handling Unix/Linux archives and compressed files under Microsoft WindowsIndexup to index

Student Tammy Rediger (17F) tells me that “the program 7zip does work with .gz, .bzip2 and .tar files” under Microsoft Windows.

Author: 
| Ian! D. Allen, BA, MMath  -  idallen@idallen.ca  -  Ottawa, Ontario, Canada
| Home Page: http://idallen.com/   Contact Improv: http://contactimprov.ca/
| College professor (Free/Libre GNU+Linux) at: http://teaching.idallen.com/
| Defend digital freedom:  http://eff.org/  and have fun:  http://fools.ca/

Plain Text - plain text version of this page in Pandoc Markdown format

Campaign for non-browser-specific HTML   Valid XHTML 1.0 Transitional   Valid CSS!   Creative Commons by nc sa 3.0   Hacker Ideals Emblem   Author Ian! D. Allen