Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 16 (September 1997)

Perl excels at those little annoying tasks -- ones that don't generally have an ``off-the-shelf'' solution or tool, but come up from time to time as you are maintaining your own data, or performing some sort of system administration for others.

For example, let's take a typical task: removing uneeded empty directories. Imagine you have a web or FTP server to distribute various software packages, and different things get moved into and out of the distribution area (say, /archive) from time to time. The tools to ``publish'' a particular package knows enough to create subdirectories as needed, but for various reasons, the tools that ``unpublish'' a package don't remove anything but the top-level directory of that package.

What you'll end up with over time is a tree of partially full and sometimes empty directories. You've decided it's time to clean this tree up, perhaps as a nightly cron-job. You want to delete all the empty directories, but obviously leave all the directories that are actually doing something.

Let's take a stab at it with a recursive subroutine. We'll have to use a directory handle to read the contents of each directory, and when we discover a subdirectory, recurse into it and then try to blow it away if it's finally empty.

    &smash("/archive");
    sub smash {
      my $dir = shift;
      opendir DIR, $dir or return;
      my @contents =
        map "$dir/$_",
        sort grep !/^\.\.?$/,
        readdir DIR;
      closedir DIR;
      foreach (@contents) {
        next unless !-l && -d;
        &smash($_);
        rmdir $_;
      }
    }

Initially, the &smash subroutine is invoked with a full path to the top-level directory. This becomes the $dir variable inside the subroutine, which is then opened as a directory. (Bad directories are automatically skipped when the opendir fails.)

The contents of the directory are then preened and processed. Reading that four-line expression from back to front, we read the directory, throw out ``.'' and ``..'', sort it alphabetically, and then put the directory name in front of each name.

After that, the foreach loop walks $_ through the list of names, identifying the eligible subdirectory candidates (not a symlink, but is a directory). For each subdirectory, we call &smash, and then try to remove what is hopefully now an empty directory. If the rmdir() fails, no big deal: it was a directory with still-active contents, and is serving a real purpose.

While this code works fine, recursing through subdirectories is such a common operation that it'd be a shame to rewrite code that looks like this over and over again. For one thing, it's easy to make a small mistake (like not checking for symbolic links when looking for subdirectories). Fortunately, a standard library, called File::Find, provides all the right stuff to do the job.

Here's the same code rewritten using the standard library:

    use File::Find;
    finddepth (\&wanted, "/archive");
    sub wanted {
      rmdir $_;
    }

Here, File::Find installs a subroutine called finddepth. This subroutine expects a reference to a subroutine (here \&wanted) as its first parameter, and a list of top-level directories following. finddepth recurses through all the filesystem beginning at the indicated directories, calling &wanted for each name that's found, with some variables set up to make it easier. For example, $_ is set to the basename of the file (no directory part, but that's OK because finddepth automatically does a chdir() to the parent directory). So, all we have to do is try to rmdir() it, which will fail if it's not a directory or not empty!

Even that's a little too much typing. We can use an anonymous subroutine in place of creating a named subroutine:

    use File::Find;
    finddepth (sub {
                 rmdir $_;
               }, "/archive");

Here, the sub { } construct is enough to create a subroutine without a name, and pass its coderef in as the first parameter.

You could even get to all of this from the Unix commandline:

    perl -MFile::Find \
        -e 'finddepth(sub {rmdir $_}, @ARGV)' \
        /archive

Here, the -M is like the use directive earlier, and we're taking the names from @ARGV, making this a nice easily aliasable command.

If that seems like too many things to remember, you can also use the slick find2perl tool to generate code that looks like this. This tool takes an extended find command syntax, and generates a Perl script:

    find2perl /archive -depth -eval 'rmdir $_' |
        perl

Notice here I'm feeding the script directly to Perl. (See the manpage for find2perl for more details.)

But let's go back to more intense uses of File::Find from within programs and show how this makes cooler things possible. For example, I'm sometimes curious about where all my disk space is going, so it'd be nice to get a report of the top 20 biggest files. Here's a little script to do that:

    use File::Find;
    find (sub {
            $size{$File::Find::name} =
              -s if -f;
          }, @ARGV);
    @sorted = sort {
      $size{$b} <=> $size{$a}
    } keys %size;
    splice @sorted, 20 if @sorted > 20;
    foreach (@sorted) {
      printf "%10d %s\n", $size{$_}, $_;
    }

Here, I'm using File::Find again, which also defines a find routine as well as finddepth, with similar parameters. (The difference between the two routines is minor, having to do with whether we get notified about a particular directory after or before its contents.) Inside the anonymous subroutine, besides getting $_ set to the basename of the file, we also get $File::Find::name set to the full name of the file, handy here.

For each $_ that is a file (tested via -f), we save its size (via -s) into a hash keyed by the full pathname ($File::Find::name). When the data is finally acquired, we then get a list of filenames sorted by their sizes, using a sort block on the keys. Then, we toss everything after the first 20 entries and print out the result.

Small, elegant, simple. File::Find lets us attack those recursive problems so much easier than the corresponding shell scripts.

Let's try one slightly more complicated example. Suppose we have the sources to the zark tool (a fictional utility) unpacked into /usr/local/src/zark, and we want to build this tool for both the supercycle and ultracycle machines, in separate build directories, but sharing the source tree as much as possible. One way to do that is to create a symlink tree which mirrors the source tree's directory structure and contains symlinks to all the corresponding files.

In particular, doing this by hand would look something like:

    ## Bourne shell: set variables
    source=/usr/local/src/zark
    build=/usr/local/build/supercycle/zark
    mkdir $build
    ln -s $source/Makefile $build/Makefile
    ln -s $source/zark.c $build/zark.c
    ## and so on...
    ## make subdirectories:
    mkdir $build/zifflelib
    ln -s $source/zifflelib/Makefile $build/zifflelib/Makefile
    ln -s $source/zifflelib/zif.c $build/zifflelib/zif.c
    ## and so on..
    mkdir $build/zufflib
    ln -s $source/zufflib/Makefile $build/zufflib/Makefile
    ln -s $source/zufflib/zuf.c $build/zufflib/zuf.c
    ## and so on..

As you can see, this'd be pretty tedious. So, let's do it the simple way... with Perl:

    use File::Find;
    $src = shift; # first arg is source
    $dst = shift; # second arg is dest
    find(sub {
           (my $rel_name = $File::Find::name)
             =~ s!.*/\./!!s;
           my $src_name = "$src/$rel_name";
           my $dst_name = "$dst/$rel_name";
           if (-d) {
             print "mkdir $dst_name\n";
             mkdir $dst_name, 0777
               or warn "mkdir $dst_name: $!";
           } else {
             print "ln -s $src_name $dst_name\n";
             symlink $src_name, $dst_name
               or warn "symlink $src_name $dst_name: $!";
           }
         }, "$src/./");

The trickiest part of this find subroutine is determining the common pathname parts of both the source and destinations, which I'm doing by including a fake "/./" in the middle of the source path. This doesn't affect the effective path but is easy enough to scan for with the regular expression (shown here in computing $rel_name).

I hope you enjoyed this little excursion into the hierarchical filesystem, and that the examples provided will save you some time in the future. After all, that's what Perl is all about.

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Unix Review Column 16 (September 1997)