Copyright Notice
This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Linux Magazine Column 73 (Aug 2005)
[Suggested title: ``File::Finder: finding files easier'']
Sooner or later, every Perl hacker ends up wanting to process a
collection of files contained within a directory, including all the
files in all the subdirectories. Thankfully, Perl comes with the
File::Find
module to perform this task in a tested, portable
manner. File::Find
's basic interface is simple:
use File::Find; sub wanted { return unless -f; print $File::Find::name, "\n"; } find \&wanted, ".";
Here, File::Find
exports the find
routine, which takes a coderef
and a list of starting points (here, ``dot'' for the current directory).
The mechanism inside File::Find
locates all filesystem entries
below (and including) the list of starting points, and calls the subroutine
referenced by the coderef for each entry. Note that File::Find
doesn't filter anything. It's up to the subroutine to ignore the entries
that are not of interest.
The wanted
routine gets the full pathname in $File::Find::name
,
and the basename in $_
. For efficiency, the current process is
also chdir
'ed to the directory being examined, so either
$File::Find::name
or $_
can be used to access the filesystem
entry being examined. However, if you want to use the names,
afterward, you should always collect the $File::Find::name
values,
because you'll no longer be in the proper directory for $_
.
In this case, our wanted
routine is returning if the entry for
$_
is not a file (the -f
test conveniently defaulting to $_
).
If we don't return, then the full pathname is printed, followed by a
newline. The result will be a series of all the names of all the files
within the current directory (and below) printed to standard output.
The equivalent Unix find
command line to perform this function is:
find . -type f -print
Here, the starting points are listed first, followed by a series of one or more conditions, followed (usually) by some action to perform. In this case, again, starting at ``dot'', we'll recurse, looking for all files, printing the ones that we found.
Some time ago, I wrote File::Finder
to be able to translate find
commands rather directly into Perl code that would then use
File::Find
behind the scenes to do the work. The equivalent code
using File::Finder
(found in the CPAN) looks like:
use File::Finder; File::Finder->type('f')->print->in('.');
Note that the type
and print
method calls correspond exactly to
the find
arguments. Only the in
call is out of order,
specifying a list of starting points after the conditions are
specified.
But how is this working? The result of calling type
on the File::Finder
class results in a File::Finder
object, similar to having said:
use File::Finder; my $ff1 = File::Finder->new->type('f');
Inside the File::Finder
object, the type
method call has
recorded a step: a coderef that will ultimately check a pathname to
see if it is a file or not. The code to create this step is in the
File::Finder::Steps
class, automatically selected by sneaky delegation
inside the File::Finder
object.
Next, the File::Finder
object is duplicated by the print
method
call, adding a second step to ultimately print the pathname in question:
my $ff2 = $ff1->print;
The value of $ff1
is untouched. In fact, we can use it as the
starting point of another File::Finder
rule.
At this point, $ff2
can be used as the wanted
routine in File::Find
directly:
use File::Find; find $ff2, ".";
The File::Finder
object recognizes that it is being used in a place
where a coderef is wanted, and turns itself into a wanted
routine
that will execute the series of steps it contains. Thus, we get the
series of files printed on standard output as we want.
But, continuing on with the example, we can also call in
on
this object:
$ff2->in('.');
This effectively does the same thing, loading File::Find
to call
the find
routine, passing the constructed wanted
routine as
well. However, in
has an additional feature: the matching names
are gathered and returned in a list context, or a count of the names
in a scalar context:
my @names = $ff2->in('.');
Of course, we've printed them all. If we didn't want them printed,
we could go back to the previous File::Finder
object:
my @names = $ff1->in('.');
The in
routine is actually a specialization of the gather
routine,
which returns a list of the concatenated return values of the coderef
executed for each entry:
my %size_of = $ff1->gather(sub { $File::Find::name => -s }, '.');
Here, for each file, we'll execute the coderef, which returns a two-element list of the name and its corresponding size. When we concatenate the resulting lists, we get key/value pairs in the right shape to initialize the hash.
What if we wanted more conditions, like all files that start with a dot?
In find
, we'd say this as:
find . -type f -name '.*' -print
And similary, using File::Finder
:
File::Finder->type('f')->name('.*')->print->in('.');
Again, the File::Finder
representation is a straightforward
translation from the find
command-line. Note that the name
step
takes a string which is treated as a filename glob. If I pass a regexp
object instead, I get a regexp match:
File::Finder->type('f')->name(qr/^\./)->print->in('.');
How about the files that don't begin with a dot? While we could simply
change the glob to *
, let's introduce a not
instead:
find . -type f \! -name '.*' -print
And the equivalent File::Finder
is similar again:
File::Finder->type('f')->not->name('.*')->print->in('.');
Note that not
negates the test of the step that immediately
follows.
The default connection between type
and name
is and. We
can spell that out directly:
find . -type f -a -name '.*' -print
The and
here is a short-circuit and
, meaning that if the left
side of the and
is false, the right side is ignored. This also is
controlling whether the print
is executed, which we can see by
adding the second and
:
find . -type f -a -name '.*' -a -print
We can write this expanded version in File::Finder
as well:
File::Finder->type('f')->and->name('.*')->and->print->in('.');
In both cases, the and
is merely a syntax helper, and does not
change the execution. The expression is computed from left to right,
and the first false step stops the execution, and thus keeps the
pathname from being printed.
We can introduce an or condition, which is also short circuiting. These are typically used to say ``everything except'':
find . -type f -o -print
If the path is a file, the or
stops, because a true value on the
left keeps the expression on the right from executing. So, we end up
with everything that isn't a file. In File::Finder
, we still
have a direct correspondance:
File::Finder->type('f')->or->print->in('.');
What if we wanted to print all entries that are either a file or
beginning with a dot? Because of the relative precedence of and
and or
,
we need to use parentheses in the find
command line:
find . '(' -type f -o -name '.*' ')' -print
To indicate parentheses in File::Finder
, we add left
and right
:
File::Finder->left->type('f')->or->name('.*')->right->print->in('.');
Again, a direct correspondence with the find
command.
The print
operation returns a true value as well as printing the
name: useful to know if we chain any further steps after print
.
The find
command supports a prune
option: if prune
is
executed, and the entry is a directory, the directory is then
skipped, and not entered recursively. Let's say we're looking at
an SVN tree, and we don't want to descend into (or consider) any
.svn
directories:
find . -type d -name '.svn' -prune -o -type f -print
If we're looking at a directory, and the directory is named .svn
,
then we'll execute prune
. This tells find
to not descend into
this directory. If that also returns true, the or
skips the
remaining evaluation. If the and-ed expression to the left of the
or
is false, then we'll continue by requiring the path to be a
file, and if so, we'll print it. In File::Finder
, again the
correspondence is straightforward:
my $prune_svn = File::Finder->type('d')->name('.svn')->prune; $prune_svn->or->type('f')->print->in('.');
Note that we saved $prune_svn
as a separate object. We can reuse
this to collect only directories:
my @dirs = $prune_svn->or->type('d')->in('.');
Being able to reuse these components allows building the condition in manageable pieces.
We can also evaluate arbitrary Perl code at a particular step. The
code is executed as part of a File::Find
``wanted'' operation, so it
gets all the same treatment. If the code returns true, then the step
is also considered true. For example, suppose we want to make sure
that symlinks point at a valid file entry. We can add a step made
with eval
to check -l and not -e
for dangling symlinks:
my @danglers = $prune_svn->or->eval(sub { -l and not -e })->in('.');
The eval
step also accepts File::Finder
objects, allowing us
to create subroutines:
my $file = File::Finder->type('f'); my $begins_with_dot = File::Finder->name('.*');
my $file_or_begins_with_dot = File::Finder ->eval($file)->or->eval($begins_with_dot); my @dotfiles = $prune_svn->or->eval($file_or_begins_with_dot)->in('.');
This is an alternative to using parentheses to achieve the same
result, because I can consider the eval
subcomponent to be
parenthesized.
Although File::Finder
operates similarly to the older
File::Find::Rule
, I personally find that the syntax of
File::Finder
is more natural. I might explain this as having spent
years writing find
commands, dealing with the slightly weird
and/or/not/paren syntax for complex rules.
However, File::Find::Rule
supports conditions that File::Finder
doesn't understand (yet!). So, to allow me to leverage the existing
File::Find::Rules
conditions and plugins, I can use a ffr
step
with a File::Find::Rule
object, and the appropriate condition
is interpreted. For example, to find images that have greater than
1000 pixels in both directions, I would create the File::Find::Rule
object first:
use File::Find::Rule; use File::Find::Rule::ImageSize;
my $ffr_big_images = File::Find::Rule->image_x('>1000')->image_y('>1000');
And now I can use this FFR step with File::Finder
use File::Finder; my $big_images = File::Finder->ffr($ffr_big_images); my %sizes = $big_images->gather(sub { $File::Find::name => -s }, 'Pictures');
I hope you find that File::Finder
finds its way into your toolkit.
Until next time, enjoy!