Copyright Notice
This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Linux Magazine Column 73 (Aug 2005)
[Suggested title: ``File::Finder: finding files easier'']
Sooner or later, every Perl hacker ends up wanting to process a
collection of files contained within a directory, including all the
files in all the subdirectories. Thankfully, Perl comes with the
File::Find module to perform this task in a tested, portable
manner. File::Find's basic interface is simple:
use File::Find;
sub wanted {
return unless -f;
print $File::Find::name, "\n";
}
find \&wanted, ".";
Here, File::Find exports the find routine, which takes a coderef
and a list of starting points (here, ``dot'' for the current directory).
The mechanism inside File::Find locates all filesystem entries
below (and including) the list of starting points, and calls the subroutine
referenced by the coderef for each entry. Note that File::Find
doesn't filter anything. It's up to the subroutine to ignore the entries
that are not of interest.
The wanted routine gets the full pathname in $File::Find::name,
and the basename in $_. For efficiency, the current process is
also chdir'ed to the directory being examined, so either
$File::Find::name or $_ can be used to access the filesystem
entry being examined. However, if you want to use the names,
afterward, you should always collect the $File::Find::name values,
because you'll no longer be in the proper directory for $_.
In this case, our wanted routine is returning if the entry for
$_ is not a file (the -f test conveniently defaulting to $_).
If we don't return, then the full pathname is printed, followed by a
newline. The result will be a series of all the names of all the files
within the current directory (and below) printed to standard output.
The equivalent Unix find command line to perform this function is:
find . -type f -print
Here, the starting points are listed first, followed by a series of one or more conditions, followed (usually) by some action to perform. In this case, again, starting at ``dot'', we'll recurse, looking for all files, printing the ones that we found.
Some time ago, I wrote File::Finder to be able to translate find
commands rather directly into Perl code that would then use
File::Find behind the scenes to do the work. The equivalent code
using File::Finder (found in the CPAN) looks like:
use File::Finder;
File::Finder->type('f')->print->in('.');
Note that the type and print method calls correspond exactly to
the find arguments. Only the in call is out of order,
specifying a list of starting points after the conditions are
specified.
But how is this working? The result of calling type on the File::Finder
class results in a File::Finder object, similar to having said:
use File::Finder;
my $ff1 = File::Finder->new->type('f');
Inside the File::Finder object, the type method call has
recorded a step: a coderef that will ultimately check a pathname to
see if it is a file or not. The code to create this step is in the
File::Finder::Steps class, automatically selected by sneaky delegation
inside the File::Finder object.
Next, the File::Finder object is duplicated by the print method
call, adding a second step to ultimately print the pathname in question:
my $ff2 = $ff1->print;
The value of $ff1 is untouched. In fact, we can use it as the
starting point of another File::Finder rule.
At this point, $ff2 can be used as the wanted routine in File::Find
directly:
use File::Find; find $ff2, ".";
The File::Finder object recognizes that it is being used in a place
where a coderef is wanted, and turns itself into a wanted routine
that will execute the series of steps it contains. Thus, we get the
series of files printed on standard output as we want.
But, continuing on with the example, we can also call in on
this object:
$ff2->in('.');
This effectively does the same thing, loading File::Find to call
the find routine, passing the constructed wanted routine as
well. However, in has an additional feature: the matching names
are gathered and returned in a list context, or a count of the names
in a scalar context:
my @names = $ff2->in('.');
Of course, we've printed them all. If we didn't want them printed,
we could go back to the previous File::Finder object:
my @names = $ff1->in('.');
The in routine is actually a specialization of the gather routine,
which returns a list of the concatenated return values of the coderef
executed for each entry:
my %size_of = $ff1->gather(sub { $File::Find::name => -s }, '.');
Here, for each file, we'll execute the coderef, which returns a two-element list of the name and its corresponding size. When we concatenate the resulting lists, we get key/value pairs in the right shape to initialize the hash.
What if we wanted more conditions, like all files that start with a dot?
In find, we'd say this as:
find . -type f -name '.*' -print
And similary, using File::Finder:
File::Finder->type('f')->name('.*')->print->in('.');
Again, the File::Finder representation is a straightforward
translation from the find command-line. Note that the name step
takes a string which is treated as a filename glob. If I pass a regexp
object instead, I get a regexp match:
File::Finder->type('f')->name(qr/^\./)->print->in('.');
How about the files that don't begin with a dot? While we could simply
change the glob to *, let's introduce a not instead:
find . -type f \! -name '.*' -print
And the equivalent File::Finder is similar again:
File::Finder->type('f')->not->name('.*')->print->in('.');
Note that not negates the test of the step that immediately
follows.
The default connection between type and name is and. We
can spell that out directly:
find . -type f -a -name '.*' -print
The and here is a short-circuit and, meaning that if the left
side of the and is false, the right side is ignored. This also is
controlling whether the print is executed, which we can see by
adding the second and:
find . -type f -a -name '.*' -a -print
We can write this expanded version in File::Finder as well:
File::Finder->type('f')->and->name('.*')->and->print->in('.');
In both cases, the and is merely a syntax helper, and does not
change the execution. The expression is computed from left to right,
and the first false step stops the execution, and thus keeps the
pathname from being printed.
We can introduce an or condition, which is also short circuiting. These are typically used to say ``everything except'':
find . -type f -o -print
If the path is a file, the or stops, because a true value on the
left keeps the expression on the right from executing. So, we end up
with everything that isn't a file. In File::Finder, we still
have a direct correspondance:
File::Finder->type('f')->or->print->in('.');
What if we wanted to print all entries that are either a file or
beginning with a dot? Because of the relative precedence of and and or,
we need to use parentheses in the find command line:
find . '(' -type f -o -name '.*' ')' -print
To indicate parentheses in File::Finder, we add left and right:
File::Finder->left->type('f')->or->name('.*')->right->print->in('.');
Again, a direct correspondence with the find command.
The print operation returns a true value as well as printing the
name: useful to know if we chain any further steps after print.
The find command supports a prune option: if prune is
executed, and the entry is a directory, the directory is then
skipped, and not entered recursively. Let's say we're looking at
an SVN tree, and we don't want to descend into (or consider) any
.svn directories:
find . -type d -name '.svn' -prune -o -type f -print
If we're looking at a directory, and the directory is named .svn,
then we'll execute prune. This tells find to not descend into
this directory. If that also returns true, the or skips the
remaining evaluation. If the and-ed expression to the left of the
or is false, then we'll continue by requiring the path to be a
file, and if so, we'll print it. In File::Finder, again the
correspondence is straightforward:
my $prune_svn = File::Finder->type('d')->name('.svn')->prune;
$prune_svn->or->type('f')->print->in('.');
Note that we saved $prune_svn as a separate object. We can reuse
this to collect only directories:
my @dirs = $prune_svn->or->type('d')->in('.');
Being able to reuse these components allows building the condition in manageable pieces.
We can also evaluate arbitrary Perl code at a particular step. The
code is executed as part of a File::Find ``wanted'' operation, so it
gets all the same treatment. If the code returns true, then the step
is also considered true. For example, suppose we want to make sure
that symlinks point at a valid file entry. We can add a step made
with eval to check -l and not -e for dangling symlinks:
my @danglers = $prune_svn->or->eval(sub { -l and not -e })->in('.');
The eval step also accepts File::Finder objects, allowing us
to create subroutines:
my $file = File::Finder->type('f');
my $begins_with_dot = File::Finder->name('.*');
my $file_or_begins_with_dot = File::Finder
->eval($file)->or->eval($begins_with_dot);
my @dotfiles = $prune_svn->or->eval($file_or_begins_with_dot)->in('.');
This is an alternative to using parentheses to achieve the same
result, because I can consider the eval subcomponent to be
parenthesized.
Although File::Finder operates similarly to the older
File::Find::Rule, I personally find that the syntax of
File::Finder is more natural. I might explain this as having spent
years writing find commands, dealing with the slightly weird
and/or/not/paren syntax for complex rules.
However, File::Find::Rule supports conditions that File::Finder
doesn't understand (yet!). So, to allow me to leverage the existing
File::Find::Rules conditions and plugins, I can use a ffr step
with a File::Find::Rule object, and the appropriate condition
is interpreted. For example, to find images that have greater than
1000 pixels in both directions, I would create the File::Find::Rule
object first:
use File::Find::Rule; use File::Find::Rule::ImageSize;
my $ffr_big_images = File::Find::Rule->image_x('>1000')->image_y('>1000');
And now I can use this FFR step with File::Finder
use File::Finder;
my $big_images = File::Finder->ffr($ffr_big_images);
my %sizes = $big_images->gather(sub { $File::Find::name => -s }, 'Pictures');
I hope you find that File::Finder finds its way into your toolkit.
Until next time, enjoy!

