Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 43 (Sep 2002)

[suggested title: ``Finding old things'']

One of the great things about the relatively large size of the Perl community is the many ways in which information can be obtained about Perl to solve typical tasks, or get past that sticky little frustrating points. Besides the professional documentation that others and I have written, there's also the myriad of manpages that come with the Perl distribution itself, and the CPAN modules.

And with a little searching on the net, we also quickly find very active support for Perl in the way of Usenet newsgroups, mailing lists, and a few web communities. The most active Perl web community that I participate in is known as the Perl Monastery, at http://perlmonks.org. Each day, a few hundred active users and a few thousand visitors post and answer questions, and chat in the ``chatterbox''. Activity at the Monastery gives you ``experience points'', which gives you access to more functions and acknowledges the additional responsibility and track record.

In my past roles as a system administrator, it seems like I was always under pressure to keep trying to solve problems in annoyingly brief amounts of time. The Monastery is a welcome resource, because questions often get answered within minutes, so help is just a browser-reload or two away.

Recently, a relatively new Monk (as we call the participants) who goes by the name ``Aquilo'' asked for help with a script that:

Recurses through a directory structure and checks if more than half of the file in that directory have been used in the past 180 days. The path of directory which are predominately unused is apprended to a list of directory which will be used to archive them.

Aquilo gave some sample code that performed directory recursion in the traditional beginner way. I say traditional, because it appears that one of Perl's ``rites of passage'' seems to be to ``write a directory recursion routine''. Generally, these solutions are not as flexible or efficient or portable as Perl's built-in File::Find module, but the beginners generally aren't aware of this module.

A typical hand-rolled directory recursion starts like so:

  sub do_this {
    my ($dir) = shift;
    if (opendir DIR, $dir) { # it's a directory
      for my $entry (sort readdir DIR) {
        do_this("$dir/$entry"); # recurse
      }
    } else { # it's a file
      # operate on $dir (not shown)
    }
  }
  do_this("fred"); # our top-level directory

And when that gets typed in, and it runs forever, the beginner will start to scratch their head wondering why. Well, the basic structure is correct, except that the readdir call in the fourth line returns not only the files contained within the directory, but also the directories contained within the directory. That's good for the most part, but we also get the always-present ``dot'' and ``dotdot'' entries pointing at the current directory and the parent directory.

This means that while we are processing our first directory (say fred), we'll also recurse to process that directory followed by slash-dot (fred/.), which processes the same directory again, which then recurses to process that new name followed by slash-dot (fred/./.), forever. Ooops! Similarly, we'll also process the parent directory, which then contains this directory as an entry. Even the fastest supercomputer cannot process an infinite loop in less than infinite time.

So, the next move the beginner usually takes is to strip out the dot files, or maybe just dot and dot-dot. Something like this:

  sub do_this {
    my ($dir) = shift;
    if (opendir DIR, $dir) { # it's a directory
      for my $entry (sort readdir DIR) {
        next if $entry eq "." or $entry eq "..";
        do_this("$dir/$entry"); # recurse
      }
    } else { # it's a file
      # operate on $dir (not shown)
    }
  }
  do_this("fred"); # our top-level directory

And this is a bit better. For the most part, everything runs fine, but we still run into trouble when we hit a symbolic link (or ``symlink''). If the symlink points to a directory in a parallel part of the tree, that's usually OK. But if it points to a directory deeper in the current tree, we'll process that portion twice: once as a result of following the symlink, and once when we actually get there by normal directory recursion.

And then there's the possibility of a symlink pointing to a directory above our current directory. It's just as bad as following dot-dot: we'll process the new directory recursively, coming right back down to where we've already gone. Infinite loop time again.

So, the next necessary refinement is often ``ignore symbolic links'':

  sub do_this {
    my ($dir) = shift;
    if (opendir DIR, $dir) { # it's a directory
      for my $entry (sort readdir DIR) {
        next if $entry eq "." or $entry eq "..";
        next if -l "$dir/$entry";
        do_this("$dir/$entry"); # recurse
      }
    } else { # it's a file
      # operate on $dir (not shown)
    }
  }
  do_this("fred"); # our top-level directory

And there you have a perfectly fine directory recursion routine. As long as you're running on Unix, and not MacOS, VMS, OS/2, or Windows, because the step that creates $dir/$entry is wrong for those others. And, there are some speed optimization steps to keep from recursing into directories that have no subdirectories that we haven't even begun to consider here. Luckily, that's all done for us in the File::Find module, so let's get back to solving the issue raised by Aquilo.

First, we'll pull in the module:

  use File::Find;

Next, let's set up a hash: actually, a nested hash.

  my %ages;

We'll use this to keep track of the number of items in the directory that are both newer and older than 180 days old. We'll make each key in %ages be the full directory path name. The corresponding value will be a hashref pointing to a hash with two keys in it: old and new. The values of those two entries will be a running count of the number of files of that category in that directory.

Now the easy and fun part: calling File::Find::find:

  find \&wanted, "fred";

The find routine expects a reference to a subroutine (``coderef'') as its first parameter. It then descends down into all of the directories listed as the remaining arguments: in this case, just fred. The subroutine gets called for every entry below the listed directories, with a few parameters set. Let's take a look at wanted now:

  sub wanted {
    return unless -f $_;
    my $old_flag = -M $_ > 180 ? 'old' : 'new';
    $ages{$File::Find::dir}{$old_flag}++;
  }

There's a lot going on in these few lines. First, the subroutine returns quickly if we're not looking at a file. The value of $_ is the basename of the current entry (like $entry in the hand-rolled version earlier), and our current directory is the directory being examined. Next, $old_flag is set to either old or new, based on looking at the file modification time. Finally, the hash referenced by the value of $ages{$File::Find::dir} has one of its two entries incremented, based on this $old_flag. The value of $File::Find::dir is the full pathname to our current directory. So, if a file in fred/a named dino was newer than 180 days, then ${"fred/a"}{new} would be incremented, showing that we have a new file.

After the directory recursion pass is complete, we merely have to walk the data to see what's old enough now:

  for my $dir (sort keys %ages) {
    if ($ages{$dir}{old} > $ages{$dir}{new}) {
      print "$dir has more old than new\n";
    }
  }

And there you have it! For all directories that have more old entries than new entries, we'll report it.

On that last comparison, we'll get warnings if -w or use warnings is enabled, because some directories have new items but not old, or old items but not new. For a small program like this, I probably wouldn't bother enabling warnings, but if you're a stickler for that, then you might also want to clean it up a bit before comparing:

  for my $dir (sort keys %ages) {
    my $old = $ages{$dir}{old} || 0;
    my $new = $ages{$dir}{new} || 0;
    if ($old > $new and $old > 50) {
      print "$dir has more old than new\n";
    }
  }

And notice, while I was cleaning up, I couldn't resist tinkering a bit, making it so that directories with less than 50 old items aren't reported. Just one more thing that was made slightly easier.

And there you have it. A typical system administration task hacked out in a few lines of code, and a reference to a great resource. Until next time, enjoy!

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Unix Review Column 43 (Sep 2002)