Winter 2019 - January to April 2019 - Updated 2019-01-06 04:31 EST
findIndexThis is optional material for CST8207
The Problem:
The
findcommand is showing me pathnames. I could use the mouse to copy-and-paste these pathnames into manycpcommands, but surely there must be a way to automate this? Can thecpcommand select file names the same way thatfindcan?
The idea of Unix/Linux is that every command does one thing well, so
they don’t put features of find into cp. You use find to generate
the names and you use cp to copy the names. The trick is getting the
names generated by find to be used by cp.
For an introductory assignment, I don’t expect more knowledge than copy and paste using your mouse, but that’s not how a real sysadmin would do it. Here are some optional hints on how a real sysadmin would get the pathnames copied without using a mouse or copy-and-paste.
find -execIndexThe designers of the find command built in a mechanism to run a command
using the pathnames that find finds. It’s the -exec option. Go read
man find and look at how -exec works. The man page for find has
one example in the EXAMPLES section of the man page (along with lots
of other uses of find) and you can actually use this example to run
file on a whole bunch of files:
find . -type f -exec file '{}' \;
You can append the above -exec and following arguments to any
already-working find command you have, replacing the . starting point
and -type f expression in the example with your own starting point and
expression to find the pathnames you want. The find command line with
the above added -exec expression will then run file on each of the
pathnames found by find, one at a time.
The find command will run the -exec command once per pathname.
The pathname generated by find is inserted into the -exec command line
where that quoted set of braces is. You might be able to see it better
if you insert an echo in front of the command line being run by find,
to echo on your screen the command that is being built and executed:
find . -type f -exec echo file '{}' \;
(Make sure you get this simple -exec echo file example working on your
own set of pathnames before you try to modify it to do something more
complicated such as a file copy.)
But of course you don’t want to simply run file on each pathname;
you want to copy each pathname into a single destination directory.
I’ll leave most of this as an “exercise for the student”, with the
following hint:
find will put the source
pathname argument to cp; what is missing in the above line that
uses file is the destination directory needed by cp. You will
have to add the destination directory name in the right place and
also change the command name file to be the command name cp
in the above line. Leave the echo ahead of the command line you
are building until you see find generate on your screen the cp
command lines that you know will work, then take out the echo
and let find run the multiple cp commands for you.The above is just one way to automate the copy by having find do the
work for you. It has the disadvantage that it runs a separate cp
command for every pathname find finds, which is no problem if there
are only three pathnames but is a huge problem if there are a million
pathnames because find will have to run cp a million times (and that
takes time).
Modern versions of find have a modified -exec statement ending in +
instead of ; that can pack multiple file names into the same command
execution, reducing the number of times the command has to be executed
by increasing the number of pathnames passed to each execution:
find . -type f -exec file '{}' +
This works similarly to xargs, which is described next:
xargsIndexIf you have a million files to copy, using find with the traditional
version of -exec is not the way to do it, since you will have to call
and run the cp command program once per pathname, and that means running
cp a million times. Even if cp did nothing, it would take a long
time to re-execute cp a million times. We can do this more efficiently.
The cp command is designed to allow multiple source pathnames if they
are all being copied into the same destination directory. We could
reduce the number of cp commands run if we could put multiple source
pathnames into each cp command line. If we could fit a million source
pathnames on one cp command line, we would only need one single cp
command to do the work. This is a huge savings compared to running cp
a million times.
Alas, most Unix systems have a limit on the total length of a command
line. You can’t fit a million pathnames on one single cp command line.
This is why the xargs program was written.
The xargs program reads a (usually large) list of pathnames from
standard input. It will read those pathnames and pack a command line with
as many of those pathnames as can possibly fit, then call the command,
then repeat with another large number of pathnames, and repeat again
until all the pathnames are processed. By packing each command line
as full of pathnames as it possibly can, it uses the minimum number of
commands needed to get the job done.
See the man xargs and look at the EXAMPLES section for examples using
find to generate pathnames that get sent into xargs. Sysadmin always
use the -print0 option to find and the -0 option to xargs so
that blanks in pathnames don’t cause problems. (See the man pages.)
Since xargs can only add lists of pathnames to the end of a command
line (where most commands expect them), this poses a problem for a file
copy that expects all the source filenames to precede the destination
directory name. The maintainers of cp invented the -t option to
cp so that you could specify the destination directory first on the
command line, allowing all the source pathnames to be stacked at the end
just the way xargs generates them:
$ cp -t /tmp file1 file2 file3 # file4 file5 etc...
You need to use the -t option when you use cp inside xargs so that
the list of source pathnames can appear at the end of the command line.
Again, insert echo at the start of your xargs command lines (and
start with only a few pathnames on standard input, not hundreds) until
you see echoing on your screen the command lines you know will work.
Then take out the echo and feed the full list of pathnames.
As described in the previous section, modern versions of
findhave a modified-execstatement ending in+instead of;that can pack multiple file names into the same command execution, reducing the number of times the command has to be executed by increasing the number of pathnames passed to each execution.
$(command)IndexThe shells have a command substitution feature that lets you
take the standard output of any command and insert it into a
command line. (See the heading Command Substitution in
man bash, and also previous class notes such as
CST8207 Command Substitution
or
CST8129 Command Substitution.)
You might think of using this handy feature to take the standard output
of find (a list of pathnames) and insert it into a cp command line.
This command substitution might work, but it has serious limitations:
In other words, command substitution only works sometimes, where the
other two solutions presented earlier work every time (provided you use
-print0 in your find command!).
Since sysadmin want solutions that always work and won’t mysteriously start failing in the future, avoid using command substitution to naïvely generate pathnames needed by other commands if those pathnames might ever contain blanks or other shell meta-characters, or if the list of pathnames might be very large. The embedded blanks and shell meta-characters in the pathnames, or the sheer number of pathnames, will some day cause errors if you rely on command substitution.
(With correct use of shell options to turn off file GLOBbing and suppress the splitting of words on blanks, you can almost safely write a shell script that does use command substitution and pathnames, but it isn’t pretty, doesn’t work for file names with newlines in them, and the options used are unsuitable for interactive shell use. It can still stop working if the list of pathnames is longer than is allowed on a command line. Don’t do it!)