Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Linux Magazine Column 46 (Mar 2003)

In [last month's column], I introduced the File::Find module, included as part of the core Perl distribution. The <File::Find> module provides a framework to recursively notice or manipulate directories and their contents.

The core work of most uses of File::Find is the wanted subroutine. For every file descending from one or more directories, the wanted subroutine gets called, letting it do the work. Except for communication provided via the $File::Find::prune variable, the subroutine's output is ignored. This is called a callback model.

Recently, Richard Clamp was inspired to write a wrapper around File::Find, called File::Find::Rule, that turns the actions of descending into a directory into more of a filter model. A rule object gets created, and a series of methods called against it to set up ever-narrowing filters separating those items of interest from the rest.

For example, to create a rule to find only files (and not directories or other things) that have been accessed at least 14 days ago, we create the filter as such:

  use File::Find::Rule;
  my $filter = File::Find::Rule->new;

Initially, this filter finds everything. We must restrict it:

  $filter->file; # find only files
  $filter->atime('>14'); # accessed more than 14 days ago

Now we have $filter which rejects any entry that doesn't meet both of this selections. (The default connector is and, if you want to think of it as a boolean expression.) All that remains is to give it a starting point:

  my @results = $filter->in("/tmp");

The in method takes a starting point and constructs the appropriate wanted subroutine for a call to File::Find, gathering up all of the entries that meet the conditions, and returns the result. The filter still remains, however, and can be reused:

  push @results, $filter->in("/usr/tmp");

although we could have done this all at once with:

  my @results = $filter->in("/tmp", "/usr/tmp");

So far, this looks a bit messier than simply writing an appropriate wanted routine, but we can simplify it knowing about two shortcuts:

  1. Nearly all filter routines can be called as class methods (rather than instance methods), which will automatically instantiate a new filter and then add the rule.

  2. Nearly all filter routines return back the instance directly.

The effect of these conventions is that we can ``chain'' most of the filter rules. For example:

  use File::Find::Rule;
  my $filter = File::Find::Rule->file->atime('>14');
  my @results = $filter->in(qw(/tmp /usr/tmp));

Or even more simply:

  use File::Find::Rule;
  my @results = File::Find::Rule
    ->file
    ->atime('>14')
    ->in(qw(/tmp /usr/tmp));

(Oddly enough, while writing this column, I found a bug that prevents this code from working for version 0.08 of the module. Hopefully by the time you read this, the author will have repaired the bug and uploaded it to the CPAN.)

What we have as a result is a generator/filter wrapper around the callback scheme of File::Find. Because of the added layer, calling File::Find directly is always going to be a bit faster, but maybe harder to understand and maintain, so take your pick.

As an example of the simplification, let's go through each of the tasks presented before, and see how they would look in File::Find::Rule notation rather than a hand-crafted File::Find wanted routine. Then we'll look at a couple of the already available ``plug-ins'' for File::Find::Rule to simplify some common tasks.

For starters, we have the very common ``print everything below a given directory'', which translates as:

  use File::Find::Rule;
  print "$_\n" for File::Find::Rule->in('.');

Without any restrictions, we get every name.

The next task reports on the diskblocks used by each user below a given starting point. For this, we'll generate the entire file listing and then iterate over that:

  use File::Find::Rule;
  my %blocks;
  for (File::Find::Rule->file->in('.')) {
    next unless my @stat = stat;
    $blocks{$stat[4]} += $stat[12];
  }
  for (sort {$blocks{$b} <=> $blocks{$a}} keys %blocks) {
    printf "%16s %8d\n", scalar getpwuid($_), $blocks{$_};
  }

Here, the only restriction is that it be a file (and not something like a directory or a symbolic link). The remainder of the code is a typical ``take this filename and summarize some information about it'' loop.

Many of the remaining examples ignored the contents of the CVS directories, using the $File::Find::prune flag of the wanted callback. For these entries, we either noticed it was a CVS directory, and returned immediately, or we continued to see if it was a file, and if so, processed the file.

In the File::Find::Rule logic, the rules are normally chained together with and-style association. We can get or-style association using the any method. If we have three filters:

  my $filter1 = File::Find::Rule->method1;
  my $filter2 = File::Find::Rule->method2a->method2b;
  my $filter3 = File::Find::Rule->method3;

Then we can add an alternative of any of these choices to an existing filter chain as:

  $filter->any($filter1, $filter2, $filter3);

This is recursive, allowing us to construct arbitrarily complex rules. In practice, I haven't seen more than one or two levels of nesting required.

So, to get all files that aren't within CVS directories, we can use:

  use File::Find::Rule;
  my $prune_if_cvs = File::Find::Rule
    ->directory
    ->name("CVS")
    ->prune->discard;
  my $file = File::Find::Rule->file;
  my @files = File::Find::Rule
    ->any($prune_if_cvs, $file)
    ->in("/cvs/bigproject1");

The first filter ($prune_if_cvs) identifies a directory named CVS. If that's true, the prune special filter sets the prune flag, but continues to accept the file. The discard special filter always fails, causing this entry to be rejected.

The second filter ($file) simply accepts the entry only if it is a file.

With any, the first condition will be evaluated. If the prune and discard get executed, the result is false. (In fact, it's impossible for that branch to return true, since it ends in discard.) However, for all entries, we'll then evaluate whether it's a file, and if so, the filename is accepted.

Putting this into a larger context, let's identify the total size used by the various MIME types within the CVS tree:

  use File::MMagic;
  my $mm = File::MMagic->new;
  my %total;
  use File::Find::Rule;
  my $prune_if_cvs = File::Find::Rule
    ->directory
    ->name("CVS")
    ->prune->discard;
  my $file = File::Find::Rule->file;
  for (File::Find::Rule->any($prune_if_cvs, $file)
                       ->in("/cvs/bigproject1")) {
    my $type = $mm->checktype_filename($_);
    $total{$type}{count}++;
    $total{$type}{size} += (stat($_))[12];
    ## push @{$total{$type}{names}}, $File::Find::name;
  }
  for (sort keys %total) {
    print "$_ has $total{$_}{count} items with $total{$_}{size} blocks\n";
    ## print map "  $_\n", sort @{$total{$_}{names}};
  }

Note the use of the any logic as before. The resulting files then become the list for the first foreach loop. Within the loop, the MIME type is identified, and used as a key to distinguish the summarizing items. (As in the previous version of this program, the commented lines can be uncommented to include a complete list of all files that share a given MIME type.)

So far, all of these examples have created an entire list before starting to process the values. But just like the Unix find command has an exec switch, so too does a File::Find::Rule filter.

The exec filter executes a subroutine reference (coderef), passing the basename, directory name, and the full path name as the first three parameters. For convenience, the basename of the file is also present in $_. If the subroutine returns a true value, then the name is still considered ``accepted'', and the filters continue. However, we'll typically use this as the final stage in a filter, so the return value doesn't matter.

The advantage in using an exec filter is most evident when we're iterating over a large portion of the disk (like a search from the top root directory, for example). Using the return value, we'll wait until the entire filter chain has executed over all of the directories. Using an exec filter means we get the names one at a time, as we find them, leading to more immediate results and a slightly more efficient use of resources (since we don't have to generate the humongous list).

Rewriting that last example using exec, we get:

  use File::MMagic;
  my $mm = File::MMagic->new;
  my %total;
  use File::Find::Rule;
  my $prune_if_cvs = File::Find::Rule
    ->directory
    ->name("CVS")
    ->prune->discard;
  my $file = File::Find::Rule->file;
  File::Find::Rule
    ->any($prune_if_cvs, $file)
    ->exec( sub {
      my $type = $mm->checktype_filename($_);
      $total{$type}{count}++;
      $total{$type}{size} += (stat($_))[12];
    } )->in("/cvs/bigproject1");
  for (sort keys %total) {
    print "$_ has $total{$_}{count} items with $total{$_}{size} blocks\n";
  }

Note that most of the code remained the same. We just put the first foreach loop body inside the exec filter.

The design of File::Find::Rule is ``pluggable'', meaning that the author has provided a place to add additional filter rules should we find something lacking in the core. At the moment, the CPAN already includes four such plugins.

The ImageSize plugin adds a filter to use the Image::Size module to select or reject images based on their size (delete all thumbnails from a directory, for example). The MP3Info plugin uses MP3::Info to allow selection of MP3 audio files based on things like artist and album name. The Digest plugin uses the Digest module to compute various digest functions, like MD5 or SHA1, helping to locate or reject files based on their digest values.

Let's look at the fourth plugin in a bit more detail, to give you an idea about the other three. The MMagic plugin gives us a magic filter, which accepts only those MIME types that match one or more glob patterns. For example, a filter that finds only text files can be constructed with:

  use File::Find::Rule::MMagic;
  my $text_files = File::Find::Rule->magic('text/*');

Note that we had to include the plugin module specifically. The plugin automatically loads File::Find::Rule, and extends the filter list to include magic.

So, looking at the last example from last time, let's print out all of the plain text files in the CVS tree:

  use File::Find::Rule::MMagic;
  my $prune_if_cvs = File::Find::Rule
    ->directory
    ->name("CVS")
    ->prune->discard;
  my $file = File::Find::Rule->file;
  @ARGV = sort File::Find::Rule
    ->any($prune_if_cvs, $file)
    ->magic('text/plain')
    ->in("/cvs/bigproject1");
  while (<>) {
    print "$ARGV\t$_";
  }

Note that we're loading @ARGV, so that the final while loop can iterate over the newly-found file list, opening each file and displaying the contents.

I think Richard Clamp is on to something here. I like the ease with which a filter object can be constructed, and I understand that he's also working on a command-line version of the interface as well. That'd really be full circle, to have gone from Unix find to Perl's File::Find, then to a rule-based syntax, and then back to the command line. Amazing. Well, until next time, enjoy your new file finding skills!


Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.