Copyright Notice
This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Linux Magazine Column 46 (Mar 2003)
In [last month's column], I introduced the File::Find
module,
included as part of the core Perl distribution. The <File::Find>
module provides a framework to recursively notice or manipulate
directories and their contents.
The core work of most uses of File::Find
is the wanted
subroutine. For every file descending from one or more directories,
the wanted subroutine gets called, letting it do the work. Except
for communication provided via the $File::Find::prune
variable, the
subroutine's output is ignored. This is called a callback model.
Recently, Richard Clamp was inspired to write a wrapper around
File::Find
, called File::Find::Rule
, that turns the actions of
descending into a directory into more of a filter model. A rule
object gets created, and a series of methods called against it to set
up ever-narrowing filters separating those items of interest from the
rest.
For example, to create a rule to find only files (and not directories or other things) that have been accessed at least 14 days ago, we create the filter as such:
use File::Find::Rule; my $filter = File::Find::Rule->new;
Initially, this filter finds everything. We must restrict it:
$filter->file; # find only files $filter->atime('>14'); # accessed more than 14 days ago
Now we have $filter
which rejects any entry that doesn't meet
both of this selections. (The default connector is and, if you
want to think of it as a boolean expression.) All that remains is
to give it a starting point:
my @results = $filter->in("/tmp");
The in
method takes a starting point and constructs the appropriate
wanted subroutine for a call to File::Find
, gathering up all of
the entries that meet the conditions, and returns the result. The filter
still remains, however, and can be reused:
push @results, $filter->in("/usr/tmp");
although we could have done this all at once with:
my @results = $filter->in("/tmp", "/usr/tmp");
So far, this looks a bit messier than simply writing an appropriate wanted routine, but we can simplify it knowing about two shortcuts:
-
Nearly all filter routines can be called as class methods (rather than instance methods), which will automatically instantiate a new filter and then add the rule.
-
Nearly all filter routines return back the instance directly.
The effect of these conventions is that we can ``chain'' most of the filter rules. For example:
use File::Find::Rule; my $filter = File::Find::Rule->file->atime('>14'); my @results = $filter->in(qw(/tmp /usr/tmp));
Or even more simply:
use File::Find::Rule; my @results = File::Find::Rule ->file ->atime('>14') ->in(qw(/tmp /usr/tmp));
(Oddly enough, while writing this column, I found a bug that prevents this code from working for version 0.08 of the module. Hopefully by the time you read this, the author will have repaired the bug and uploaded it to the CPAN.)
What we have as a result is a generator/filter wrapper around the
callback scheme of File::Find
. Because of the added layer, calling
File::Find
directly is always going to be a bit faster, but maybe
harder to understand and maintain, so take your pick.
As an example of the simplification, let's go through each of the
tasks presented before, and see how they would look in
File::Find::Rule
notation rather than a hand-crafted File::Find
wanted routine. Then we'll look at a couple of the already
available ``plug-ins'' for File::Find::Rule
to simplify some common
tasks.
For starters, we have the very common ``print everything below a given directory'', which translates as:
use File::Find::Rule; print "$_\n" for File::Find::Rule->in('.');
Without any restrictions, we get every name.
The next task reports on the diskblocks used by each user below a given starting point. For this, we'll generate the entire file listing and then iterate over that:
use File::Find::Rule; my %blocks; for (File::Find::Rule->file->in('.')) { next unless my @stat = stat; $blocks{$stat[4]} += $stat[12]; } for (sort {$blocks{$b} <=> $blocks{$a}} keys %blocks) { printf "%16s %8d\n", scalar getpwuid($_), $blocks{$_}; }
Here, the only restriction is that it be a file (and not something like a directory or a symbolic link). The remainder of the code is a typical ``take this filename and summarize some information about it'' loop.
Many of the remaining examples ignored the contents of the CVS
directories, using the $File::Find::prune
flag of the wanted
callback. For these entries, we either noticed it was a CVS
directory, and returned immediately, or we continued to see if it was
a file, and if so, processed the file.
In the File::Find::Rule
logic, the rules are normally chained
together with and-style association. We can get or-style
association using the any
method. If we have three filters:
my $filter1 = File::Find::Rule->method1; my $filter2 = File::Find::Rule->method2a->method2b; my $filter3 = File::Find::Rule->method3;
Then we can add an alternative of any of these choices to an existing filter chain as:
$filter->any($filter1, $filter2, $filter3);
This is recursive, allowing us to construct arbitrarily complex rules. In practice, I haven't seen more than one or two levels of nesting required.
So, to get all files that aren't within CVS
directories, we can use:
use File::Find::Rule; my $prune_if_cvs = File::Find::Rule ->directory ->name("CVS") ->prune->discard; my $file = File::Find::Rule->file; my @files = File::Find::Rule ->any($prune_if_cvs, $file) ->in("/cvs/bigproject1");
The first filter ($prune_if_cvs
) identifies a directory named
CVS
. If that's true, the prune
special filter sets the prune
flag, but continues to accept the file. The discard
special
filter always fails, causing this entry to be rejected.
The second filter ($file
) simply accepts the entry only if it is a
file.
With any
, the first condition will be evaluated. If the prune
and discard
get executed, the result is false. (In fact, it's
impossible for that branch to return true, since it ends in
discard
.) However, for all entries, we'll then evaluate whether
it's a file, and if so, the filename is accepted.
Putting this into a larger context, let's identify the total size used by the various MIME types within the CVS tree:
use File::MMagic; my $mm = File::MMagic->new; my %total;
use File::Find::Rule; my $prune_if_cvs = File::Find::Rule ->directory ->name("CVS") ->prune->discard; my $file = File::Find::Rule->file; for (File::Find::Rule->any($prune_if_cvs, $file) ->in("/cvs/bigproject1")) { my $type = $mm->checktype_filename($_); $total{$type}{count}++; $total{$type}{size} += (stat($_))[12]; ## push @{$total{$type}{names}}, $File::Find::name; } for (sort keys %total) { print "$_ has $total{$_}{count} items with $total{$_}{size} blocks\n"; ## print map " $_\n", sort @{$total{$_}{names}}; }
Note the use of the any
logic as before. The resulting files then
become the list for the first foreach
loop. Within the loop, the
MIME type is identified, and used as a key to distinguish the
summarizing items. (As in the previous version of this program, the
commented lines can be uncommented to include a complete list of all
files that share a given MIME type.)
So far, all of these examples have created an entire list before
starting to process the values. But just like the Unix find
command has an exec
switch, so too does a File::Find::Rule
filter.
The exec
filter executes a subroutine reference (coderef), passing
the basename, directory name, and the full path name as the first
three parameters. For convenience, the basename of the file is also
present in $_
. If the subroutine returns a true value, then the
name is still considered ``accepted'', and the filters continue.
However, we'll typically use this as the final stage in a filter, so
the return value doesn't matter.
The advantage in using an exec
filter is most evident when we're
iterating over a large portion of the disk (like a search from the top
root directory, for example). Using the return value, we'll wait
until the entire filter chain has executed over all of the
directories. Using an exec
filter means we get the names one at a
time, as we find them, leading to more immediate results and a
slightly more efficient use of resources (since we don't have to
generate the humongous list).
Rewriting that last example using exec
, we get:
use File::MMagic; my $mm = File::MMagic->new; my %total;
use File::Find::Rule; my $prune_if_cvs = File::Find::Rule ->directory ->name("CVS") ->prune->discard; my $file = File::Find::Rule->file; File::Find::Rule ->any($prune_if_cvs, $file) ->exec( sub { my $type = $mm->checktype_filename($_); $total{$type}{count}++; $total{$type}{size} += (stat($_))[12]; } )->in("/cvs/bigproject1"); for (sort keys %total) { print "$_ has $total{$_}{count} items with $total{$_}{size} blocks\n"; }
Note that most of the code remained the same. We just put the first
foreach
loop body inside the exec
filter.
The design of File::Find::Rule
is ``pluggable'', meaning that the
author has provided a place to add additional filter rules should we
find something lacking in the core. At the moment, the CPAN already
includes four such plugins.
The ImageSize
plugin adds a filter to use the Image::Size
module
to select or reject images based on their size (delete all thumbnails
from a directory, for example). The MP3Info
plugin uses
MP3::Info
to allow selection of MP3 audio files based on things
like artist and album name. The Digest
plugin uses the Digest
module to compute various digest functions, like MD5 or SHA1, helping
to locate or reject files based on their digest values.
Let's look at the fourth plugin in a bit more detail, to give you an
idea about the other three. The MMagic
plugin gives us a magic
filter, which accepts only those MIME types that match one or more
glob patterns. For example, a filter that finds only text files
can be constructed with:
use File::Find::Rule::MMagic; my $text_files = File::Find::Rule->magic('text/*');
Note that we had to include the plugin module specifically. The
plugin automatically loads File::Find::Rule
, and extends the filter
list to include magic
.
So, looking at the last example from last time, let's print out all of the plain text files in the CVS tree:
use File::Find::Rule::MMagic; my $prune_if_cvs = File::Find::Rule ->directory ->name("CVS") ->prune->discard; my $file = File::Find::Rule->file; @ARGV = sort File::Find::Rule ->any($prune_if_cvs, $file) ->magic('text/plain') ->in("/cvs/bigproject1"); while (<>) { print "$ARGV\t$_"; }
Note that we're loading @ARGV
, so that the final while loop can
iterate over the newly-found file list, opening each file and
displaying the contents.
I think Richard Clamp is on to something here. I like the ease with
which a filter object can be constructed, and I understand that he's
also working on a command-line version of the interface as well.
That'd really be full circle, to have gone from Unix find to Perl's
File::Find
, then to a rule-based syntax, and then back to the
command line. Amazing. Well, until next time, enjoy your new file
finding skills!