Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 30 (Nov 2001)

[suggested title: Finding Files Incrementally]

The traditional File::Find module included with Perl is nice. I use it frequently. However, it's got this interesting feature, or shall we call it, ``limitation'': it wants to be in control until it has found everything of interest.

Now that's perfectly fine for most applications, but occasionally, I've wanted to turn a hierarchical search ``inside out''. That is, I'd set up the search, including the start directories and the ``wanted'' criteria, and then call some function repeatedly to get the ``next'' item of interest. This is similar to how I can call the find program externally:

  open FIND, "find /home -atime +60 -print|";
  while (<FIND>) {
    chomp;
    printf "%s aged %d days using %d blocks\n",
      $_, -A $_, -s $_ / 1024;
  }

Recently, I saw Mark-Jason Dominus, Perl hacker extraordinaire, give a talk on using streams and iterators, and he pulled out an example that was very close to what I wanted. He was showing off a subroutine that acted as a factory, generating subroutines that when repeatedly invoked, provided the ``next'' file to be found in sequence. (It's one of his examples from his forthcoming book, some of which may already be found at http://perl.plover.com/book/.) He then went on to show how to put this inside a filter to throw away the files that did not meet the necessary criteria.

I started puzzling through that, and decided to take a different but similar approach. Invoking a subroutine repeatedly is functional, but not a natural Perl construct. Iteration in Perl is more likely accomplished by reading from a filehandle, so I first pondered through the idea of using a tied filehandle to hold the iterator. (If you're not familiar with the concept of a tied object, see the perltie manpage, included with the standard Perl documentation.) But then I started getting hung up on what ``seek'' should mean (if supported at all) and how to return structured data from a filehandle read, and how to reset the search back to the beginning.

But after a while, it dawned on me: the tie mechanism was right, but I should use a hash instead of a filehandle! A hash has a nice iteration interface in the form of calls to each. And a tied hash automatically sequences through the entire iteration when keys is called, so getting the full list was simple as well.

After puttering with it a bit, I came up with the code for this month's column. The interface provides a virtual hash. The keys of the hash in their natural order are the filenames meeting the criteria, obtained in the same order as they are found. The corresponding values are the results from the wanted subroutine, defaulting to an array ref of the stat value. The criteria and starting directories are established when the hash is tied. For example, the find code above would be rewritten as:

  use TieFinder;
  tie my %f, TieFinder::, sub { -A > 60 }, "/home";
  while (defined ($_ = each %f)) {
    chomp;
    printf "%s aged %d days using %d blocks\n",
      $_, -A $_, -s $_ / 1024;
  }

So let's look at [listing one, below] and see how I did it.

Lines 1 through 3 start nearly every program I write, turning on warnings, enabling compiler restrictions, and unbuffering standard output.

Line 5 pulls in the Data::Dumper routines, so I could show what the ``hash'' looked like during testing.

Lines 7 through 56 form a virtual use block. I could extract all of this part of the file, stick it into TieFinder.pm somewhere along my @INC path, and replace that with use TieFinder. But for testing purposes, it was easier to include it into the same file.

To tie a hash, we have to respect a certain set of standard method calls in the object class we're creating. The most important of these is TIEHASH, defined in lines 10 through 18. This constructor must return a blessed object in this (or a related) class. The arguments in @_ are taken from the tie parameter list.

After shifting off the class name in line 11, we next see if the first parameter is a coderef. If so, this is the wanted subroutine, which we save in $wanted. If not, we use a default wanted which merely returns the first parameter passed to it. (We'll see later that this is an arrayref of the stat call on the file, but I'm getting ahead of myself.) The default wanted selects every entry by having returned a true value each time.

Lines 13 to 17 construct the object as a blessed hashref, saving the wanted coderef, the original top-level dirs (useful when we reset the iterator), and ``primes the pump'' by setting those files as the current to-be-dispensed names as well. We put those in reverse order because the normal behavior is to pop the queue array, and we want them to come out in the original expected order.

Lines 20 to 41 define the NEXTKEY method, used when the each operator acts upon the tied hash, or repeatedly when some higher-level construct (like keys) needs to act on the entire hash. This is the guts of the iterator, identifying files to return, directories to descend, and generally managing the flow.

The loop in line 22 to 39 finds the first element of the queue that is suitable as a returned key. Initially the queue will be the top level directory (or directories) given during the tie operation. The first item is popped from the queue in line 23, and is stat'ed in line 24 (as an lstat so we can differentiate a symbolic link from what it points at).

Lines 25 to 33 generate the descendants of a directory if needed. First, in line 26 we skip over any symbolic links, because this could lead to infinite recursing if it's a symbolic link pointing to a superior directory. (I'm using the stat-less shortcut for the symlink testing operator here, since I know that the most recent stat was our file in question.) Lines 27 and 28 attempt to open a directory handle on the path, which will simply fail if the directory is not a readable directory, keeping us from having to pre-test that possibility.

Lines 29 through 31 read the directory, throw away the dot and dot-dot entries, sort them alphabetically, prepend the directory path, and then reverse the order of the names. These names are pushed onto the end of the queue, meaning that they'll be the next names to be processed. This creates a ``depth-first'' transversal. To get a ``breadth-first'' transversal, we'd put names on one end of the queue, and pull them off the other end. A more flexible routine would allow this to be selected via a configuration parameter. We could also do some jiggering in here to get pre-traversal versus post-traversal handling of the directory names themselves.

Lines 34 to 37 set up the calling of the wanted routine. First, the item is aliased into $_. Next, the wanted routine is called, passing the stat structure as a single arrayref parameter. The return value is kept in a private hash belonging to the object. If the user of the tied hash later asks for the value corresponding to the given key, we'll need to return it, so we have to stash it in a cache hash. If the value is not true, the next in line 34 tries the next item in the queue. Otherwise, we return the item, which becomes the next key of our virtual hash. If the queue drains, we've used up all the keys, and we return undef in line 40 to show that.

Lines 43 to 48 define the FIRSTKEY method, used when the tied hash interface wants to reset the key generation to the first item. In this case, we'll reset the queue to the initial top directories, flush the cache hash, and call NEXTKEY to actually fetch the value.

Lines 50 to 55 define the FETCH method, used when the tied hash interface wants to get a value for a give key. I've inserted a debug statement in here for tracing, which was quite informative. The value comes from the cache hash, set to the return value from wanted for that particular path name. Note that a user can make an explicit call to fetch a value that might not yet have been populated, in which case we return undef. Perhaps a more robust implementation would die if the key did not exist in the cache hash.

And that's it. We now have the means to implement a hash which appears to have the results of a recursive file find operation as keys, and the corresponding values being the return values of a user-defined function, defaulting to the stat-values for that file. The user-defined function can also return a false value to indicate that a particular file is not desired.

Let's play with it a bit, starting in line 58. The @ARGV array here provides the top-level directory starting points. The wanted subroutine returns either the size in bytes, or the true-but-0-valued string of 0e0 on the empty files. This selects everything, but gives us an interesting integer to look at. Other subroutines I used while testing include:

  sub { -d _ ? 2 : -T _ ? 1 : 0 }

which returns 2 for directories, 1 for text files, and skips over any binary files, as well as:

  sub { /\d/ }

which kept only the files that had digits somewhere in the pathname. The $_ variable has the full path here, unlike File::Find, in which $_ has only the basename. Another choice might be:

  sub { shift->[2] & 2 }

for world-writable files, looking at the mode word of the stat structure, and and-ing it with 2 for the world-writeable value. And there's even:

  sub { sprintf "0%03o", shift->[2] & 07777 }

which selects all files, but provides a text string of the permission values, or even

  sub { localtime shift->[9] }

which gives back the modification time as a string in local time format. The possibilities are endless.

Lines 60 to 63 show an example of a complete scan, using the each operator. The normal scan causes the queue to be built up and drained as needed, allowing us to ``process'' the entry as quick as it is known by the routine. The keys are obtained by calling FIRSTKEY and NEXTKEY, and the values are fetched with FETCH. Uncomment the debug line in FETCH to see this happening.

Lines 65 to 68 show an example of getting just the keys, by using the each operator in a scalar context to kick the iterator without fetching the value. Again, uncommenting the debug line in FETCH shows that no calls are made for the values, making this a painless run through the data, and yet we've got control as soon as some of the data has been returned.

The last three demonstrations use Dumper to show that this acts as an ordinary hash with respect to the keys and values operators, and even when passed as a hashref directly to Dumper. Very nice. Oddly, I noticed that the very last example seems to call FETCH twice for each element, at least under Perl version 5.005_03. Perhaps that's been altered in a later version.

And there you have it. A powerful model for painless filewalking, as well as a nice demonstration of using Perl's built-in interfaces to capture the notion of an iterator. Until next time, enjoy!

Listing

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use Data::Dumper;
        =6=     
        =7=     BEGIN {
        =8=       package TieFinder;
        =9=     
        =10=      sub TIEHASH {
        =11=        my $class = shift;
        =12=        my $wanted = (@_ && ref $_[0] eq 'CODE') ? shift : sub { shift };
        =13=        bless {
        =14=               wanted => $wanted,
        =15=               dirs => [@_],
        =16=               queue => [reverse @_], # prime the pump
        =17=              }, $class;
        =18=      }
        =19=    
        =20=      sub NEXTKEY {
        =21=        my $self = shift;
        =22=        while ( @{$self->{queue}} ) {
        =23=          my $item = pop @{$self->{queue}};
        =24=          my @lstat = lstat($item);
        =25=          {
        =26=            next if -l _;           # don't recurse on symlinks
        =27=            local *DH;
        =28=            opendir DH, $item or next;
        =29=            push @{$self->{queue}},
        =30=              reverse map "$item/$_",
        =31=                sort grep { $_ ne "." and $_ ne ".." } readdir DH;
        =32=            closedir DH;
        =33=          }
        =34=          next unless do {
        =35=            local $_ = $item;
        =36=            $self->{hash}{$item} = $self->{wanted}->(\@lstat);
        =37=          };
        =38=          return $item;
        =39=        }
        =40=        return undef;
        =41=      }
        =42=    
        =43=      sub FIRSTKEY {
        =44=        my $self = shift;
        =45=        $self->{queue} = [reverse @{$self->{dirs}}];
        =46=        $self->{hash} = {};
        =47=        $self->NEXTKEY;
        =48=      }
        =49=    
        =50=      sub FETCH {
        =51=        my $self = shift;
        =52=        my $key = shift;
        =53=        ## warn "DEBUG: fetching $key\n";
        =54=        $self->{hash}{$key};
        =55=      }
        =56=    }
        =57=    
        =58=    tie my %f, TieFinder::, sub { -s _ or "0e0" }, @ARGV;
        =59=    
        =60=    print "key value\n";
        =61=    while (my($k, $v) = each %f) {
        =62=      print "k = $k, v = $v\n";
        =63=    }
        =64=    
        =65=    print "key\n";
        =66=    while (defined(my $k = each %f)) {
        =67=      print "k = $k\n";
        =68=    }
        =69=    
        =70=    print "dumper keys\n";
        =71=    print Dumper([keys %f]);
        =72=    
        =73=    print "dumper values\n";
        =74=    print Dumper([values %f]);
        =75=    
        =76=    print "dumper hash\n";
        =77=    print Dumper(\%f);

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.