Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 36 (Feb 2001)

[suggested title: 'What is that, anyway?']

So, you've got a directory full of mixed stuff, or maybe an entire tree of directories. Just what's behind each of those names? Are they directories, symbolic links, or just plain files? And if they're files, are they text files or binary files? And if they're binary files, are they images, executables, or some random garbage?

Perl has many built-in operators to make getting lists of names easy, and also for figuring out what you really have once you have a name.

For example, let's find all the subdirectories within the current directory:

  for my $name (glob '*') {
    next unless -d $name;
    print "one directory is $name\n";
  }

Here, the glob operator expands to all the non-dot-prefixed names within the current directory, and the -d operator returns true for all those names which are directories.

What if we wanted to do this recursively? We need to step outside of the core perl, but not very far away. A core-included module called File::Find takes care of nearly all of our recursive directory processing problems. Let's find all directories below the current directory:

  use File::Find;
  find sub {
    return unless -d $_;
    print "one directory is $File::Find::name\n";
  }, ".";

The find subroutine takes a subroutine reference (called a coderef), here provided with the anonymous subroutine constructor. Each name found below . (specified on the last line of this snippet) will trigger an invocation of this subroutine, with $File::Find::name set to the full name, and $_ set to the basename (with the working directory already selected to the directory in which the name is located).

If you run this, you'll see that each directory is typically shown two or more times! Once as a name within its parent directory, once as the name of . when we're in the directory, and perhaps one or more times for each of the subdirectories contained within the directory. So how do we eliminate that? Well, just rejecting ``dot'' and ``dot-dot'' in the subroutine will do nicely:

  use File::Find;
  find sub {
    return if $_ eq "." or $_ eq "..";
    return unless -d $_;
    print "one directory is $File::Find::name\n";
  }, ".";

There. We'll keep moving forward from this as our base, because rejecting the meta-links of dot and dot-dot is generally a useful thing.

What about all the symbolic links? Can we find those? Sure! That's the -l operator:

  use File::Find;
  find sub {
    return if $_ eq "." or $_ eq "..";
    return unless -l $_;
    print "one symlink is $File::Find::name\n";
  }, ".";

Cool! But where do they point? That's the readlink operator, as in:

  use File::Find;
  find sub {
    return if $_ eq "." or $_ eq "..";
    return unless -l $_;
    my $dest = readlink($_);
    print "one symlink is $File::Find::name, pointing to $dest\n";
  }, ".";

We can skip the -l test by knowing that any non-symlink will automatically return undef on the readlink, as in:

  use File::Find;
  my @search = @ARGV;
  @search = qw(.) unless @search;
  find sub {
    return if $_ eq "." or $_ eq "..";
    return unless defined (my $dest = readlink($_));
    print "one symlink is $File::Find::name, pointing to $dest\n";
  }, @search;

I've also made it simpler to run this on different directories by passing them on the command line.

So, what do we have left? We can notice and skip over directories and symbolic links. How about files? Files are where the real action is located. And some of them are text-like, and some of them are binary-like. Although even those lines are blurry: you could argue that XML is really just a text-like binary format, and a Microsoft Word document is clearly text inside a binary-like format.

But back to what Perl can help with, first. Let's add the -T operator to distinguish those text files:

  use File::Find;
  my @search = @ARGV;
  @search = qw(.) unless @search;
  find sub {
    return if -d $_ or -l $_;
    return unless -T $_;
    print "One text file is $File::Find::name\n";
  }, @search;

And that's pretty cool. Just a list of text files. But this actually doesn't tell us too much. What we might really want is a list of all the Perl scripts. What can tell us that? Well, the Unix command called file can peer inside the contents of a file to figure out what it is. Let's invoke that on each file:

  use File::Find;
  my @search = @ARGV;
  @search = qw(.) unless @search;
  find sub {
    return if -d $_ or -l $_;
    my $file_said = `file $_`;
    if ($file_said =~ /perl/) {
      print "$File::Find::name: $file_said";
    }
  }, @search;

Hey, look at that. Now we're pulling out just the names that file insists are possibly Perl programs. But this program will slow to a crawl on a large tree. We're reinvoking the file command individually on every file in the tree.

There's a couple of ways to go from here to speed it up. I could save all the filenames to invoke file once at the end of the program:

  use File::Find;
  my @search = @ARGV;
  @search = qw(.) unless @search;
  my @list;
  find sub {
    return if -d $_ or -l $_;
    push @list, $File::Find::name;
  }, @search;
  for (`file @list`) {
    if (/perl/) {
      print;
    }
  }

And yes, that sped it up considerably faster, but now we don't get the results until the end of the tree walk, and we'll run into problems if the number of arguments exceeds a comfortable limit for file.

But there's another way. Out in the CPAN (at places such as search.cpan.org), we can find the File::MMagic module, which apparently is a Perl module derived from the file command created for the PPT project originally based on code written for Apache to implement the mod_mime module, to emulate the standard file command. Wow. And now I'm going to write a recursive controllable file-like program on top of that. Will the reuse ever stop? (I hope not!)

So, what we need from this module is the method called checktype_filename which returns back a MIME type (like text/plain or image/jpeg), and perhaps a semicolon and some additional information. So let's find all the Perl scripts quickly. First, after a little playing around, I see that the string I'm looking for has ``executable'' followed by a space, then something ending in ``perl'' followed by a space and then ``script''. That's a simple regular expression, so I'll add that at the right place:

  use File::Find;
  use File::MMagic;
  my $mm = File::MMagic->new;
  my @search = @ARGV;
  @search = qw(.) unless @search;
  my @list;
  find sub {
    return if -d $_ or -l $_;
    my $type = $mm->checktype_filename($_);
    next unless $type =~ /executable \S+\/perl script/;
    print "$File::Find::name: $type\n";
  }, @search;

Now I know what programs to look at when I upgrade, to see which modules they all use. (Hmm. Sounds like an idea for another column. I'll note that.)

And one last fun one. Let's find all the images in the tree, and then call Image::Size (also found in the CPAN) on them to see their respective sizes. Just a few more tweaks:

  use File::Find;
  use File::MMagic;
  use Image::Size;
  my $mm = File::MMagic->new;
  my @search = @ARGV;
  @search = qw(.) unless @search;
  my @list;
  find sub {
    return if -d $_ or -l $_;
    my $type = $mm->checktype_filename($_);
    next unless $type =~ /^image\//;
    print "$File::Find::name: $type: ";
    my ($x, $y, $imgtype) = imgsize($_);
    if (defined $x) {
      print "$imgtype: $x x $y\n";
    } else {
      print "error: $imgtype\n";
    }
  }, @search;

And as it turns out, I could have left the File::MMagic out of this program, since Image::Size can cheerfully inform me when it wasn't called on an image, but you know the old Perl motto: There's More Than One Way To Do It!

So, next time someone asks you ``what do you have?'', I hope you can answer them with a nice short Perl program now. Until next time, enjoy!

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Unix Review Column 36 (Feb 2001)