Copyright Notice
This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Linux Magazine Column 45 (Feb 2003)
[suggested title: Finding things]
``Where do we start?''
This is the phrase I often utter as I'm beginning a new magazine article, a response to a PerlMonks or Usenet posting, or even a new book. It's also a phrase I use when writing a new program. Where is the data coming from? How do I find the data?
Often, the tasks I tackle with Perl involve taking data from one or more files, performing some sort of data reduction or reporting task against the data, then generating one or more files or actions. The files are often all within a single directory, but just as often, it seems the files are in a hierarchy of directories.
The canonical Unix command-line tool for dealing with files in a
hierarchy of directories is the find
command. For example,
getting a listing of all files in /tmp
that have not been accessed
in the past 14 days is as simple as typing:
find /tmp -type f -atime +14 -print
And finding all files or directories below the current directory is again as simple as:
find . -print
When faced with similar tasks within a Perl program, you could simply
call the find
command. Or, you could stare at the documentation
for the readdir
operator for a while, and write your own recursive
directory descent routine.
But it's usually simpler to use the File::Find
module, included
with the core of the Perl distribution for many recent releases.
Let's see how this works, by starting with that simple find
command.
To list the names of all files and directories below the current directory, use this code:
use File::Find; find \&wanted, "."; sub wanted { print "$File::Find::name\n"; }
The first line brings in the File::Find
module, defining the
find
routine. The second line invokes the find
routine, passing
it two parameters. The first parameter is a ``callback'': a reference
to a subroutine that will be called for each name (such as a file or
directory) found below the directory given as the second parameter.
(You can include more than one starting point if you wish, but none of
these examples uses that feature.)
For every entry below the starting directory (here ``.''), find
will
call wanted
, passing it the full path in $File::Find::name
,
which we're printing. The current directory is set to the directory
containing the name, and $_
is set to just the basename (the path
without the directory part) of the file. This strategy permits
maximum flexibility and speed, as we'll see in the later examples.
The wanted
subroutine is being used only by the find
invocation.
Rather than coming up with a subroutine name just to use it in one
other place, we can use an anonymous subroutine instead, saving a bit
of brainpower trying to come up with a name:
use File::Find; find sub { print "$File::Find::name\n"; }, ".";
Because the filename is placed in $_
, we can use file tests
with the default argument to narrow our display. For example,
suppose we wanted only the directories in our display:
use File::Find; find sub { return unless -d; print "$File::Find::name\n"; }, ".";
If the callback routine is handed a directory, the -d
is true, so
the return
exits the subroutine early. Note that the callback
subroutine is always called for every entry, and it's up to the
callback subroutine to reject the entries that do not meet the desired
conditions.
The /tmp
example earlier can also be processed in a similar way:
use File::Find; find sub { return unless -f; return unless -A > 14; print "$File::Find::name\n"; }, "/tmp";
The -A
operator here returns the age in days as a floating-point
value, perfect for our test.
As the find
routine recurses through directories, the callback
routine for a given directory will be called before any of the
directory contents are examined. So, if subdir
contained file1
and file2
, we'd get them in that order: subdir
, subdir/file1
,
and subdir/file2
. For some tasks, we need to see the directory
name after the contents of the directory. For this, we replace
find
with finddepth
. (This is similar to the -depth
switch
of the Unix find command.)
One example of this is when you're renaming things. Let's say you're fixing up a hierarchy of Unix files so that it can be placed onto a Joliet filesystem on a CD. While Unix restricts only NUL and slash within a filename, the Windows operating system has a much narrower view of valid filenames and filename lengths. Let's do a cheap ``rename everything illegal to an underscore'' fixup, as follows:
use File::Find; finddepth sub { (my $new = $_) =~ tr{\x00-\x1f\x80-\xFF*/:;?\\}{_}; substr($new, 128) = "" if length $new > 128; if ($new ne $_) { # needs renaming if (-e $new) { # oops, already a file by that name! warn "Cannot rename $File::Find::name to $new: file exists!\n"; } else { warn "renaming $File::Find::name to $new\n"; rename $_, $new or warn "Cannot rename $File::Find::name to $new: $!\n"; } } }, ".";
For every name below dot, first compute $new
which is the name
$_
but with all illegal characters translated to underscore. The
substr
trims the name to 128 characters or less. If the new name
is not the same as the original name, we'll check first to make sure
we're not renaming over the top of an existing file, and then attempt
to rename the file to the fixed name. We needed finddepth
here,
because if we had renamed the directory name before the contents, we
wouldn't be able to find the contents any more!
The callback subroutine's return value is ignored. How do we
accumulate any results then, like total blocks used? We simply let
the callback subroutine see an outer lexical variable, modifying it as
needed. For example, suppose we want total disk blocks used, broken
down by owner id. We'll define a %blocks
hash keyed by user
ID number, like so:
my %blocks; use File::Find; find sub { return unless -f; return unless my @stat = stat; $blocks{$stat[4]} += $stat[12]; }, ".";
At the end of this recursion, %blocks
has the total blocks broken
down by user, and a simple display loop shows the results:
for (sort {$blocks{$b} <=> $blocks{$a}} keys %blocks) { printf "%16s %8d\n", scalar getpwuid($_), $blocks{$_}; }
The recursion can be corralled by using the ``prune'' feature. If the
variable $File::Find::prune
is set to any true value during the
callback routine when looking at a directory, that directory will not
be examined further. (Of course, this works only when finddepth
is
not used, because by then it's too late.)
For example, let's look at the count and size of all files in a CVS
repository, organized by MIME type (text/plain
, image/gif
, and
so on). We'll use File::MMagic
to determine the MIME type, and
we'll need to ignore the contents of any CVS
directory:
use File::MMagic; use File::Find;
my $mm = File::MMagic->new; my %total; find sub { return $File::Find::prune = 1 if $_ eq "CVS"; return if -d; my $type = $mm->checktype_filename($_); $total{$type}{count}++; $total{$type}{size} += (stat($_))[12]; ## push @{$total{$type}{names}}, $File::Find::name; }, "/cvs/bigproject1";
At the beginning of the callback routine, if the name is exactly
CVS
, we'll set the prune variable to 1 and return from the
subroutine. This prevents the processing not only of this entry, but
also any entry below it. Next, we'll figure out the MIME type, then
compute a count and blocksize summation. The commented line can be
uncommented to track the exact filenames belonging to that MIME type.
Once the hash is created, we'll dump the data as follows:
for (sort keys %total) { print "$_ has $total{$_}{count} items with $total{$_}{size} blocks\n"; ## print map " $_\n", sort @{$total{$_}{names}}; }
If we added the filenames, we can uncomment the corresponding line in
this loop to dump the filenames as well. This is useful if you get a
MIME type of foo/bar
, and you didn't think you had any foo-bar
objects in your tree.
When I need to look at the contents of each file, and I don't need to
decide pruning based on that, I find it's faster to push all the
relevant filenames into @ARGV
, and then use a <>
loop to
examine the contents. For example, suppose I want to dump out all the
text files in the repository:
use File::Find;
@ARGV = (); find sub { return $File::Find::prune = 1 if $_ eq "CVS"; return if -d; push @ARGV, $File::Find::name if -T; }, "/cvs/bigproject1";
Here, we'll start by clearing out the @ARGV
array, then wandering
down through the repository, ignoring any CVS directories and contents,
and any other directories. For those that pass the -T
test,
we'll add the names to @ARGV
. When this is done, it's as simple
as:
@ARGV = sort @ARGV; while (<>) { print "$ARGV\t$_"; }
As each file is processed, the name of the file is placed into
$ARGV
, which I'm prefixing in front of each content line. What
if I wanted line numbers? I just need to steal a bit more code
from the perlfunc
manpage, near the eof
function description:
@ARGV = sort @ARGV; while (<>) { print "$ARGV\t$.\t$_"; } continue { close ARGV if eof; }
And now I get the filename, the line number, and the contents of the line for all the text files in my CVS tree.
Well, hopefully I've shown you some of the power of using
File::Find
. I've recently discovered in the CPAN a simple wrapper
around this module called File::Find::Rule
that makes it easier to
specify some of the more common filters, but alas, have run out of
space in this article. Perhaps I'll cover that in a future article.
Until then, enjoy!