Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Linux Magazine Column 73 (Aug 2005)

[Suggested title: ``File::Finder: finding files easier'']

Sooner or later, every Perl hacker ends up wanting to process a collection of files contained within a directory, including all the files in all the subdirectories. Thankfully, Perl comes with the File::Find module to perform this task in a tested, portable manner. File::Find's basic interface is simple:

  use File::Find;
  sub wanted {
    return unless -f;
    print $File::Find::name, "\n";
  }
  find \&wanted, ".";

Here, File::Find exports the find routine, which takes a coderef and a list of starting points (here, ``dot'' for the current directory). The mechanism inside File::Find locates all filesystem entries below (and including) the list of starting points, and calls the subroutine referenced by the coderef for each entry. Note that File::Find doesn't filter anything. It's up to the subroutine to ignore the entries that are not of interest.

The wanted routine gets the full pathname in $File::Find::name, and the basename in $_. For efficiency, the current process is also chdir'ed to the directory being examined, so either $File::Find::name or $_ can be used to access the filesystem entry being examined. However, if you want to use the names, afterward, you should always collect the $File::Find::name values, because you'll no longer be in the proper directory for $_.

In this case, our wanted routine is returning if the entry for $_ is not a file (the -f test conveniently defaulting to $_). If we don't return, then the full pathname is printed, followed by a newline. The result will be a series of all the names of all the files within the current directory (and below) printed to standard output. The equivalent Unix find command line to perform this function is:

  find . -type f -print

Here, the starting points are listed first, followed by a series of one or more conditions, followed (usually) by some action to perform. In this case, again, starting at ``dot'', we'll recurse, looking for all files, printing the ones that we found.

Some time ago, I wrote File::Finder to be able to translate find commands rather directly into Perl code that would then use File::Find behind the scenes to do the work. The equivalent code using File::Finder (found in the CPAN) looks like:

  use File::Finder;
  File::Finder->type('f')->print->in('.');

Note that the type and print method calls correspond exactly to the find arguments. Only the in call is out of order, specifying a list of starting points after the conditions are specified.

But how is this working? The result of calling type on the File::Finder class results in a File::Finder object, similar to having said:

  use File::Finder;
  my $ff1 = File::Finder->new->type('f');

Inside the File::Finder object, the type method call has recorded a step: a coderef that will ultimately check a pathname to see if it is a file or not. The code to create this step is in the File::Finder::Steps class, automatically selected by sneaky delegation inside the File::Finder object.

Next, the File::Finder object is duplicated by the print method call, adding a second step to ultimately print the pathname in question:

  my $ff2 = $ff1->print;

The value of $ff1 is untouched. In fact, we can use it as the starting point of another File::Finder rule.

At this point, $ff2 can be used as the wanted routine in File::Find directly:

  use File::Find;
  find $ff2, ".";

The File::Finder object recognizes that it is being used in a place where a coderef is wanted, and turns itself into a wanted routine that will execute the series of steps it contains. Thus, we get the series of files printed on standard output as we want.

But, continuing on with the example, we can also call in on this object:

  $ff2->in('.');

This effectively does the same thing, loading File::Find to call the find routine, passing the constructed wanted routine as well. However, in has an additional feature: the matching names are gathered and returned in a list context, or a count of the names in a scalar context:

  my @names = $ff2->in('.');

Of course, we've printed them all. If we didn't want them printed, we could go back to the previous File::Finder object:

  my @names = $ff1->in('.');

The in routine is actually a specialization of the gather routine, which returns a list of the concatenated return values of the coderef executed for each entry:

  my %size_of = $ff1->gather(sub { $File::Find::name => -s }, '.');

Here, for each file, we'll execute the coderef, which returns a two-element list of the name and its corresponding size. When we concatenate the resulting lists, we get key/value pairs in the right shape to initialize the hash.

What if we wanted more conditions, like all files that start with a dot? In find, we'd say this as:

  find . -type f -name '.*' -print

And similary, using File::Finder:

  File::Finder->type('f')->name('.*')->print->in('.');

Again, the File::Finder representation is a straightforward translation from the find command-line. Note that the name step takes a string which is treated as a filename glob. If I pass a regexp object instead, I get a regexp match:

  File::Finder->type('f')->name(qr/^\./)->print->in('.');

How about the files that don't begin with a dot? While we could simply change the glob to *, let's introduce a not instead:

  find . -type f \! -name '.*' -print

And the equivalent File::Finder is similar again:

  File::Finder->type('f')->not->name('.*')->print->in('.');

Note that not negates the test of the step that immediately follows.

The default connection between type and name is and. We can spell that out directly:

  find . -type f -a -name '.*' -print

The and here is a short-circuit and, meaning that if the left side of the and is false, the right side is ignored. This also is controlling whether the print is executed, which we can see by adding the second and:

  find . -type f -a -name '.*' -a -print

We can write this expanded version in File::Finder as well:

  File::Finder->type('f')->and->name('.*')->and->print->in('.');

In both cases, the and is merely a syntax helper, and does not change the execution. The expression is computed from left to right, and the first false step stops the execution, and thus keeps the pathname from being printed.

We can introduce an or condition, which is also short circuiting. These are typically used to say ``everything except'':

  find . -type f -o -print

If the path is a file, the or stops, because a true value on the left keeps the expression on the right from executing. So, we end up with everything that isn't a file. In File::Finder, we still have a direct correspondance:

  File::Finder->type('f')->or->print->in('.');

What if we wanted to print all entries that are either a file or beginning with a dot? Because of the relative precedence of and and or, we need to use parentheses in the find command line:

  find . '(' -type f -o -name '.*' ')' -print

To indicate parentheses in File::Finder, we add left and right:

  File::Finder->left->type('f')->or->name('.*')->right->print->in('.');

Again, a direct correspondence with the find command.

The print operation returns a true value as well as printing the name: useful to know if we chain any further steps after print.

The find command supports a prune option: if prune is executed, and the entry is a directory, the directory is then skipped, and not entered recursively. Let's say we're looking at an SVN tree, and we don't want to descend into (or consider) any .svn directories:

  find . -type d -name '.svn' -prune -o -type f -print

If we're looking at a directory, and the directory is named .svn, then we'll execute prune. This tells find to not descend into this directory. If that also returns true, the or skips the remaining evaluation. If the and-ed expression to the left of the or is false, then we'll continue by requiring the path to be a file, and if so, we'll print it. In File::Finder, again the correspondence is straightforward:

  my $prune_svn = File::Finder->type('d')->name('.svn')->prune;
  $prune_svn->or->type('f')->print->in('.');

Note that we saved $prune_svn as a separate object. We can reuse this to collect only directories:

  my @dirs = $prune_svn->or->type('d')->in('.');

Being able to reuse these components allows building the condition in manageable pieces.

We can also evaluate arbitrary Perl code at a particular step. The code is executed as part of a File::Find ``wanted'' operation, so it gets all the same treatment. If the code returns true, then the step is also considered true. For example, suppose we want to make sure that symlinks point at a valid file entry. We can add a step made with eval to check -l and not -e for dangling symlinks:

  my @danglers = $prune_svn->or->eval(sub { -l and not -e })->in('.');

The eval step also accepts File::Finder objects, allowing us to create subroutines:

  my $file = File::Finder->type('f');
  my $begins_with_dot = File::Finder->name('.*');

  my $file_or_begins_with_dot = File::Finder
    ->eval($file)->or->eval($begins_with_dot);
  my @dotfiles = $prune_svn->or->eval($file_or_begins_with_dot)->in('.');

This is an alternative to using parentheses to achieve the same result, because I can consider the eval subcomponent to be parenthesized.

Although File::Finder operates similarly to the older File::Find::Rule, I personally find that the syntax of File::Finder is more natural. I might explain this as having spent years writing find commands, dealing with the slightly weird and/or/not/paren syntax for complex rules.

However, File::Find::Rule supports conditions that File::Finder doesn't understand (yet!). So, to allow me to leverage the existing File::Find::Rules conditions and plugins, I can use a ffr step with a File::Find::Rule object, and the appropriate condition is interpreted. For example, to find images that have greater than 1000 pixels in both directions, I would create the File::Find::Rule object first:

  use File::Find::Rule;
  use File::Find::Rule::ImageSize;

  my $ffr_big_images = File::Find::Rule->image_x('>1000')->image_y('>1000');

And now I can use this FFR step with File::Finder

  use File::Finder;
  my $big_images = File::Finder->ffr($ffr_big_images);
  my %sizes = $big_images->gather(sub { $File::Find::name => -s }, 'Pictures');

I hope you find that File::Finder finds its way into your toolkit. Until next time, enjoy!

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Linux Magazine Column 73 (Aug 2005)