Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Unix Review Column 43 (Sep 2002)
[suggested title: ``Finding old things'']
One of the great things about the relatively large size of the Perl community is the many ways in which information can be obtained about Perl to solve typical tasks, or get past that sticky little frustrating points. Besides the professional documentation that others and I have written, there's also the myriad of manpages that come with the Perl distribution itself, and the CPAN modules.
And with a little searching on the net, we also quickly find very
active support for Perl in the way of Usenet newsgroups, mailing
lists, and a few web communities. The most active Perl web community
that I participate in is known as the Perl Monastery, at
http://perlmonks.org
. Each day, a few hundred active users and a
few thousand visitors post and answer questions, and chat in the
``chatterbox''. Activity at the Monastery gives you ``experience points'',
which gives you access to more functions and acknowledges the
additional responsibility and track record.
In my past roles as a system administrator, it seems like I was always under pressure to keep trying to solve problems in annoyingly brief amounts of time. The Monastery is a welcome resource, because questions often get answered within minutes, so help is just a browser-reload or two away.
Recently, a relatively new Monk (as we call the participants) who goes by the name ``Aquilo'' asked for help with a script that:
Recurses through a directory structure and checks if more than half of the file in that directory have been used in the past 180 days. The path of directory which are predominately unused is apprended to a list of directory which will be used to archive them.
Aquilo gave some sample code that performed directory recursion in the
traditional beginner way. I say traditional, because it appears that
one of Perl's ``rites of passage'' seems to be to ``write a directory
recursion routine''. Generally, these solutions are not as flexible or
efficient or portable as Perl's built-in File::Find
module, but the
beginners generally aren't aware of this module.
A typical hand-rolled directory recursion starts like so:
sub do_this { my ($dir) = shift; if (opendir DIR, $dir) { # it's a directory for my $entry (sort readdir DIR) { do_this("$dir/$entry"); # recurse } } else { # it's a file # operate on $dir (not shown) } } do_this("fred"); # our top-level directory
And when that gets typed in, and it runs forever, the beginner will
start to scratch their head wondering why. Well, the basic structure
is correct, except that the readdir
call in the fourth line returns
not only the files contained within the directory, but also the
directories contained within the directory. That's good for the most
part, but we also get the always-present ``dot'' and ``dotdot'' entries
pointing at the current directory and the parent directory.
This means that while we are processing our first directory (say
fred
), we'll also recurse to process that directory followed by
slash-dot (fred/.
), which processes the same directory again, which
then recurses to process that new name followed by slash-dot
(fred/./.
), forever. Ooops! Similarly, we'll also process the
parent directory, which then contains this directory as an entry.
Even the fastest supercomputer cannot process an infinite loop in less
than infinite time.
So, the next move the beginner usually takes is to strip out the dot files, or maybe just dot and dot-dot. Something like this:
sub do_this { my ($dir) = shift; if (opendir DIR, $dir) { # it's a directory for my $entry (sort readdir DIR) { next if $entry eq "." or $entry eq ".."; do_this("$dir/$entry"); # recurse } } else { # it's a file # operate on $dir (not shown) } } do_this("fred"); # our top-level directory
And this is a bit better. For the most part, everything runs fine, but we still run into trouble when we hit a symbolic link (or ``symlink''). If the symlink points to a directory in a parallel part of the tree, that's usually OK. But if it points to a directory deeper in the current tree, we'll process that portion twice: once as a result of following the symlink, and once when we actually get there by normal directory recursion.
And then there's the possibility of a symlink pointing to a directory above our current directory. It's just as bad as following dot-dot: we'll process the new directory recursively, coming right back down to where we've already gone. Infinite loop time again.
So, the next necessary refinement is often ``ignore symbolic links'':
sub do_this { my ($dir) = shift; if (opendir DIR, $dir) { # it's a directory for my $entry (sort readdir DIR) { next if $entry eq "." or $entry eq ".."; next if -l "$dir/$entry"; do_this("$dir/$entry"); # recurse } } else { # it's a file # operate on $dir (not shown) } } do_this("fred"); # our top-level directory
And there you have a perfectly fine directory recursion routine. As
long as you're running on Unix, and not MacOS, VMS, OS/2, or Windows,
because the step that creates $dir/$entry
is wrong for those
others. And, there are some speed optimization steps to keep from
recursing into directories that have no subdirectories that we haven't
even begun to consider here. Luckily, that's all done for us in the
File::Find
module, so let's get back to solving the issue raised by
Aquilo.
First, we'll pull in the module:
use File::Find;
Next, let's set up a hash: actually, a nested hash.
my %ages;
We'll use this to keep track of the number of items in the directory
that are both newer and older than 180 days old. We'll make each key
in %ages
be the full directory path name. The corresponding value
will be a hashref pointing to a hash with two keys in it: old
and
new
. The values of those two entries will be a running count of
the number of files of that category in that directory.
Now the easy and fun part: calling File::Find::find
:
find \&wanted, "fred";
The find
routine expects a reference to a subroutine (``coderef'') as
its first parameter. It then descends down into all of the
directories listed as the remaining arguments: in this case, just
fred
. The subroutine gets called for every entry below the listed
directories, with a few parameters set. Let's take a look at
wanted
now:
sub wanted { return unless -f $_; my $old_flag = -M $_ > 180 ? 'old' : 'new'; $ages{$File::Find::dir}{$old_flag}++; }
There's a lot going on in these few lines. First, the subroutine
returns quickly if we're not looking at a file. The value of $_
is
the basename of the current entry (like $entry
in the hand-rolled
version earlier), and our current directory is the directory being
examined. Next, $old_flag
is set to either old
or new
, based
on looking at the file modification time. Finally, the hash
referenced by the value of $ages{$File::Find::dir}
has one of its
two entries incremented, based on this $old_flag
. The value of
$File::Find::dir
is the full pathname to our current directory.
So, if a file in fred/a
named dino
was newer than 180 days, then
${"fred/a"}{new}
would be incremented, showing that we have a new
file.
After the directory recursion pass is complete, we merely have to walk the data to see what's old enough now:
for my $dir (sort keys %ages) { if ($ages{$dir}{old} > $ages{$dir}{new}) { print "$dir has more old than new\n"; } }
And there you have it! For all directories that have more old entries than new entries, we'll report it.
On that last comparison, we'll get warnings if -w
or use
warnings
is enabled, because some directories have new items but not
old, or old items but not new. For a small program like this, I
probably wouldn't bother enabling warnings, but if you're a stickler
for that, then you might also want to clean it up a bit before
comparing:
for my $dir (sort keys %ages) { my $old = $ages{$dir}{old} || 0; my $new = $ages{$dir}{new} || 0; if ($old > $new and $old > 50) { print "$dir has more old than new\n"; } }
And notice, while I was cleaning up, I couldn't resist tinkering a bit, making it so that directories with less than 50 old items aren't reported. Just one more thing that was made slightly easier.
And there you have it. A typical system administration task hacked out in a few lines of code, and a reference to a great resource. Until next time, enjoy!