Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Unix Review Column 16 (September 1997)
Perl excels at those little annoying tasks -- ones that don't generally have an ``off-the-shelf'' solution or tool, but come up from time to time as you are maintaining your own data, or performing some sort of system administration for others.
For example, let's take a typical task: removing uneeded empty
directories. Imagine you have a web or FTP server to distribute
various software packages, and different things get moved into and out
of the distribution area (say, /archive
) from time to time. The
tools to ``publish'' a particular package knows enough to create
subdirectories as needed, but for various reasons, the tools that
``unpublish'' a package don't remove anything but the top-level
directory of that package.
What you'll end up with over time is a tree of partially full and sometimes empty directories. You've decided it's time to clean this tree up, perhaps as a nightly cron-job. You want to delete all the empty directories, but obviously leave all the directories that are actually doing something.
Let's take a stab at it with a recursive subroutine. We'll have to use a directory handle to read the contents of each directory, and when we discover a subdirectory, recurse into it and then try to blow it away if it's finally empty.
&smash("/archive"); sub smash { my $dir = shift; opendir DIR, $dir or return; my @contents = map "$dir/$_", sort grep !/^\.\.?$/, readdir DIR; closedir DIR; foreach (@contents) { next unless !-l && -d; &smash($_); rmdir $_; } }
Initially, the &smash subroutine is invoked with a full path to the top-level directory. This becomes the $dir variable inside the subroutine, which is then opened as a directory. (Bad directories are automatically skipped when the opendir fails.)
The contents of the directory are then preened and processed. Reading that four-line expression from back to front, we read the directory, throw out ``.'' and ``..'', sort it alphabetically, and then put the directory name in front of each name.
After that, the foreach loop walks $_ through the list of names,
identifying the eligible subdirectory candidates (not a symlink, but
is a directory). For each subdirectory, we call &smash, and then try
to remove what is hopefully now an empty directory. If the rmdir()
fails, no big deal: it was a directory with still-active contents, and
is serving a real purpose.
While this code works fine, recursing through subdirectories is such a
common operation that it'd be a shame to rewrite code that looks like
this over and over again. For one thing, it's easy to make a small
mistake (like not checking for symbolic links when looking for
subdirectories). Fortunately, a standard library, called
File::Find
, provides all the right stuff to do the job.
Here's the same code rewritten using the standard library:
use File::Find; finddepth (\&wanted, "/archive"); sub wanted { rmdir $_; }
Here, File::Find
installs a subroutine called
finddepth
. This subroutine expects a reference to a subroutine
(here \&wanted
) as its first parameter, and a list of top-level
directories following. finddepth
recurses through all the
filesystem beginning at the indicated directories, calling &wanted for
each name that's found, with some variables set up to make it easier.
For example, $_ is set to the basename of the file (no directory part,
but that's OK because finddepth
automatically does a chdir()
to the
parent directory). So, all we have to do is try to rmdir()
it, which
will fail if it's not a directory or not empty!
Even that's a little too much typing. We can use an anonymous subroutine in place of creating a named subroutine:
use File::Find; finddepth (sub { rmdir $_; }, "/archive");
Here, the sub { }
construct is enough to create a subroutine without
a name, and pass its coderef in as the first parameter.
You could even get to all of this from the Unix commandline:
perl -MFile::Find \ -e 'finddepth(sub {rmdir $_}, @ARGV)' \ /archive
Here, the -M
is like the use
directive earlier, and we're taking
the names from @ARGV, making this a nice easily aliasable command.
If that seems like too many things to remember, you can also use the slick find2perl tool to generate code that looks like this. This tool takes an extended find command syntax, and generates a Perl script:
find2perl /archive -depth -eval 'rmdir $_' | perl
Notice here I'm feeding the script directly to Perl. (See the manpage for find2perl for more details.)
But let's go back to more intense uses of File::Find
from within
programs and show how this makes cooler things possible. For example,
I'm sometimes curious about where all my disk space is going, so it'd
be nice to get a report of the top 20 biggest files. Here's a little
script to do that:
use File::Find; find (sub { $size{$File::Find::name} = -s if -f; }, @ARGV); @sorted = sort { $size{$b} <=> $size{$a} } keys %size; splice @sorted, 20 if @sorted > 20; foreach (@sorted) { printf "%10d %s\n", $size{$_}, $_; }
Here, I'm using File::Find
again, which also defines a find
routine as well as finddepth
, with similar parameters. (The
difference between the two routines is minor, having to do with
whether we get notified about a particular directory after or before
its contents.) Inside the anonymous subroutine, besides getting $_
set to the basename of the file, we also get $File::Find::name set to
the full name of the file, handy here.
For each $_ that is a file (tested via -f
), we save its size (via
-s
) into a hash keyed by the full pathname ($File::Find::name).
When the data is finally acquired, we then get a list of filenames
sorted by their sizes, using a sort block on the keys. Then, we toss
everything after the first 20 entries and print out the result.
Small, elegant, simple. File::Find
lets us attack those recursive
problems so much easier than the corresponding shell scripts.
Let's try one slightly more complicated example. Suppose we have the
sources to the zark tool (a fictional utility) unpacked into
/usr/local/src/zark
, and we want to build this tool for both the
supercycle and ultracycle machines, in separate build
directories, but sharing the source tree as much as possible. One way
to do that is to create a symlink tree which mirrors the source
tree's directory structure and contains symlinks to all the
corresponding files.
In particular, doing this by hand would look something like:
## Bourne shell: set variables source=/usr/local/src/zark build=/usr/local/build/supercycle/zark mkdir $build ln -s $source/Makefile $build/Makefile ln -s $source/zark.c $build/zark.c ## and so on... ## make subdirectories: mkdir $build/zifflelib ln -s $source/zifflelib/Makefile $build/zifflelib/Makefile ln -s $source/zifflelib/zif.c $build/zifflelib/zif.c ## and so on.. mkdir $build/zufflib ln -s $source/zufflib/Makefile $build/zufflib/Makefile ln -s $source/zufflib/zuf.c $build/zufflib/zuf.c ## and so on..
As you can see, this'd be pretty tedious. So, let's do it the simple way... with Perl:
use File::Find; $src = shift; # first arg is source $dst = shift; # second arg is dest find(sub { (my $rel_name = $File::Find::name) =~ s!.*/\./!!s; my $src_name = "$src/$rel_name"; my $dst_name = "$dst/$rel_name"; if (-d) { print "mkdir $dst_name\n"; mkdir $dst_name, 0777 or warn "mkdir $dst_name: $!"; } else { print "ln -s $src_name $dst_name\n"; symlink $src_name, $dst_name or warn "symlink $src_name $dst_name: $!"; } }, "$src/./");
The trickiest part of this find subroutine is determining the
common pathname parts of both the source and destinations, which I'm
doing by including a fake "/./"
in the middle of the source path.
This doesn't affect the effective path but is easy enough to scan
for with the regular expression (shown here in computing $rel_name).
I hope you enjoyed this little excursion into the hierarchical filesystem, and that the examples provided will save you some time in the future. After all, that's what Perl is all about.