Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Linux Magazine Column 45 (Feb 2003)

[suggested title: Finding things]

``Where do we start?''

This is the phrase I often utter as I'm beginning a new magazine article, a response to a PerlMonks or Usenet posting, or even a new book. It's also a phrase I use when writing a new program. Where is the data coming from? How do I find the data?

Often, the tasks I tackle with Perl involve taking data from one or more files, performing some sort of data reduction or reporting task against the data, then generating one or more files or actions. The files are often all within a single directory, but just as often, it seems the files are in a hierarchy of directories.

The canonical Unix command-line tool for dealing with files in a hierarchy of directories is the find command. For example, getting a listing of all files in /tmp that have not been accessed in the past 14 days is as simple as typing:

  find /tmp -type f -atime +14 -print

And finding all files or directories below the current directory is again as simple as:

  find . -print

When faced with similar tasks within a Perl program, you could simply call the find command. Or, you could stare at the documentation for the readdir operator for a while, and write your own recursive directory descent routine.

But it's usually simpler to use the File::Find module, included with the core of the Perl distribution for many recent releases. Let's see how this works, by starting with that simple find command.

To list the names of all files and directories below the current directory, use this code:

  use File::Find;
  find \&wanted, ".";
  sub wanted {
    print "$File::Find::name\n";
  }

The first line brings in the File::Find module, defining the find routine. The second line invokes the find routine, passing it two parameters. The first parameter is a ``callback'': a reference to a subroutine that will be called for each name (such as a file or directory) found below the directory given as the second parameter. (You can include more than one starting point if you wish, but none of these examples uses that feature.)

For every entry below the starting directory (here ``.''), find will call wanted, passing it the full path in $File::Find::name, which we're printing. The current directory is set to the directory containing the name, and $_ is set to just the basename (the path without the directory part) of the file. This strategy permits maximum flexibility and speed, as we'll see in the later examples.

The wanted subroutine is being used only by the find invocation. Rather than coming up with a subroutine name just to use it in one other place, we can use an anonymous subroutine instead, saving a bit of brainpower trying to come up with a name:

  use File::Find;
  find sub {
    print "$File::Find::name\n";
  }, ".";

Because the filename is placed in $_, we can use file tests with the default argument to narrow our display. For example, suppose we wanted only the directories in our display:

  use File::Find;
  find sub {
    return unless -d;
    print "$File::Find::name\n";
  }, ".";

If the callback routine is handed a directory, the -d is true, so the return exits the subroutine early. Note that the callback subroutine is always called for every entry, and it's up to the callback subroutine to reject the entries that do not meet the desired conditions.

The /tmp example earlier can also be processed in a similar way:

  use File::Find;
  find sub {
    return unless -f;
    return unless -A > 14;
    print "$File::Find::name\n";
  }, "/tmp";

The -A operator here returns the age in days as a floating-point value, perfect for our test.

As the find routine recurses through directories, the callback routine for a given directory will be called before any of the directory contents are examined. So, if subdir contained file1 and file2, we'd get them in that order: subdir, subdir/file1, and subdir/file2. For some tasks, we need to see the directory name after the contents of the directory. For this, we replace find with finddepth. (This is similar to the -depth switch of the Unix find command.)

One example of this is when you're renaming things. Let's say you're fixing up a hierarchy of Unix files so that it can be placed onto a Joliet filesystem on a CD. While Unix restricts only NUL and slash within a filename, the Windows operating system has a much narrower view of valid filenames and filename lengths. Let's do a cheap ``rename everything illegal to an underscore'' fixup, as follows:

  use File::Find;
  finddepth sub {
    (my $new = $_) =~ tr{\x00-\x1f\x80-\xFF*/:;?\\}{_};
    substr($new, 128) = "" if length $new > 128;
    if ($new ne $_) { # needs renaming
      if (-e $new) { # oops, already a file by that name!
        warn "Cannot rename $File::Find::name to $new: file exists!\n";
      } else {
        warn "renaming $File::Find::name to $new\n";
        rename $_, $new or warn "Cannot rename $File::Find::name to $new: $!\n";
      }
    }
  }, ".";

For every name below dot, first compute $new which is the name $_ but with all illegal characters translated to underscore. The substr trims the name to 128 characters or less. If the new name is not the same as the original name, we'll check first to make sure we're not renaming over the top of an existing file, and then attempt to rename the file to the fixed name. We needed finddepth here, because if we had renamed the directory name before the contents, we wouldn't be able to find the contents any more!

The callback subroutine's return value is ignored. How do we accumulate any results then, like total blocks used? We simply let the callback subroutine see an outer lexical variable, modifying it as needed. For example, suppose we want total disk blocks used, broken down by owner id. We'll define a %blocks hash keyed by user ID number, like so:

  my %blocks;
  use File::Find;
  find sub {
    return unless -f;
    return unless my @stat = stat;
    $blocks{$stat[4]} += $stat[12];
  }, ".";

At the end of this recursion, %blocks has the total blocks broken down by user, and a simple display loop shows the results:

  for (sort {$blocks{$b} <=> $blocks{$a}} keys %blocks) {
    printf "%16s %8d\n", scalar getpwuid($_), $blocks{$_};
  }

The recursion can be corralled by using the ``prune'' feature. If the variable $File::Find::prune is set to any true value during the callback routine when looking at a directory, that directory will not be examined further. (Of course, this works only when finddepth is not used, because by then it's too late.)

For example, let's look at the count and size of all files in a CVS repository, organized by MIME type (text/plain, image/gif, and so on). We'll use File::MMagic to determine the MIME type, and we'll need to ignore the contents of any CVS directory:

  use File::MMagic;
  use File::Find;

  my $mm = File::MMagic->new;
  my %total;
  find sub {
    return $File::Find::prune = 1 if $_ eq "CVS";
    return if -d;
    my $type = $mm->checktype_filename($_);
    $total{$type}{count}++;
    $total{$type}{size} += (stat($_))[12];
    ## push @{$total{$type}{names}}, $File::Find::name;
  }, "/cvs/bigproject1";

At the beginning of the callback routine, if the name is exactly CVS, we'll set the prune variable to 1 and return from the subroutine. This prevents the processing not only of this entry, but also any entry below it. Next, we'll figure out the MIME type, then compute a count and blocksize summation. The commented line can be uncommented to track the exact filenames belonging to that MIME type.

Once the hash is created, we'll dump the data as follows:

  for (sort keys %total) {
    print "$_ has $total{$_}{count} items with $total{$_}{size} blocks\n";
    ## print map "  $_\n", sort @{$total{$_}{names}};
  }

If we added the filenames, we can uncomment the corresponding line in this loop to dump the filenames as well. This is useful if you get a MIME type of foo/bar, and you didn't think you had any foo-bar objects in your tree.

When I need to look at the contents of each file, and I don't need to decide pruning based on that, I find it's faster to push all the relevant filenames into @ARGV, and then use a <> loop to examine the contents. For example, suppose I want to dump out all the text files in the repository:

  use File::Find;

  @ARGV = ();
  find sub {
    return $File::Find::prune = 1 if $_ eq "CVS";
    return if -d;
    push @ARGV, $File::Find::name if -T;
  }, "/cvs/bigproject1";

Here, we'll start by clearing out the @ARGV array, then wandering down through the repository, ignoring any CVS directories and contents, and any other directories. For those that pass the -T test, we'll add the names to @ARGV. When this is done, it's as simple as:

  @ARGV = sort @ARGV;
  while (<>) {
    print "$ARGV\t$_";
  }

As each file is processed, the name of the file is placed into $ARGV, which I'm prefixing in front of each content line. What if I wanted line numbers? I just need to steal a bit more code from the perlfunc manpage, near the eof function description:

  @ARGV = sort @ARGV;
  while (<>) {
    print "$ARGV\t$.\t$_";
  } continue {
    close ARGV if eof;
  }

And now I get the filename, the line number, and the contents of the line for all the text files in my CVS tree.

Well, hopefully I've shown you some of the power of using File::Find. I've recently discovered in the CPAN a simple wrapper around this module called File::Find::Rule that makes it easier to specify some of the more common filters, but alas, have run out of space in this article. Perhaps I'll cover that in a future article. Until then, enjoy!

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Linux Magazine Column 45 (Feb 2003)