Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Unix Review Column 36 (Feb 2001)
[suggested title: 'What is that, anyway?']
So, you've got a directory full of mixed stuff, or maybe an entire tree of directories. Just what's behind each of those names? Are they directories, symbolic links, or just plain files? And if they're files, are they text files or binary files? And if they're binary files, are they images, executables, or some random garbage?
Perl has many built-in operators to make getting lists of names easy, and also for figuring out what you really have once you have a name.
For example, let's find all the subdirectories within the current directory:
for my $name (glob '*') { next unless -d $name; print "one directory is $name\n"; }
Here, the glob
operator expands to all the non-dot-prefixed names
within the current directory, and the -d
operator returns true for
all those names which are directories.
What if we wanted to do this recursively? We need to step outside
of the core perl, but not very far away. A core-included module
called File::Find
takes care of nearly all of our recursive
directory processing problems. Let's find all directories below
the current directory:
use File::Find; find sub { return unless -d $_; print "one directory is $File::Find::name\n"; }, ".";
The find
subroutine takes a subroutine reference (called a
coderef), here provided with the anonymous subroutine constructor.
Each name found below .
(specified on the last line of this
snippet) will trigger an invocation of this subroutine, with
$File::Find::name
set to the full name, and $_
set to the
basename (with the working directory already selected to the directory
in which the name is located).
If you run this, you'll see that each directory is typically shown two
or more times! Once as a name within its parent directory, once as
the name of .
when we're in the directory, and perhaps one or more
times for each of the subdirectories contained within the directory.
So how do we eliminate that? Well, just rejecting ``dot'' and ``dot-dot''
in the subroutine will do nicely:
use File::Find; find sub { return if $_ eq "." or $_ eq ".."; return unless -d $_; print "one directory is $File::Find::name\n"; }, ".";
There. We'll keep moving forward from this as our base, because rejecting the meta-links of dot and dot-dot is generally a useful thing.
What about all the symbolic links? Can we find those? Sure! That's
the -l
operator:
use File::Find; find sub { return if $_ eq "." or $_ eq ".."; return unless -l $_; print "one symlink is $File::Find::name\n"; }, ".";
Cool! But where do they point? That's the readlink
operator, as in:
use File::Find; find sub { return if $_ eq "." or $_ eq ".."; return unless -l $_; my $dest = readlink($_); print "one symlink is $File::Find::name, pointing to $dest\n"; }, ".";
We can skip the -l
test by knowing that any non-symlink
will automatically return undef
on the readlink
, as in:
use File::Find; my @search = @ARGV; @search = qw(.) unless @search; find sub { return if $_ eq "." or $_ eq ".."; return unless defined (my $dest = readlink($_)); print "one symlink is $File::Find::name, pointing to $dest\n"; }, @search;
I've also made it simpler to run this on different directories by passing them on the command line.
So, what do we have left? We can notice and skip over directories and symbolic links. How about files? Files are where the real action is located. And some of them are text-like, and some of them are binary-like. Although even those lines are blurry: you could argue that XML is really just a text-like binary format, and a Microsoft Word document is clearly text inside a binary-like format.
But back to what Perl can help with, first. Let's add the -T
operator to distinguish those text files:
use File::Find; my @search = @ARGV; @search = qw(.) unless @search; find sub { return if -d $_ or -l $_; return unless -T $_; print "One text file is $File::Find::name\n"; }, @search;
And that's pretty cool. Just a list of text files. But this actually doesn't tell us too much. What we might really want is a list of all the Perl scripts. What can tell us that? Well, the Unix command called file can peer inside the contents of a file to figure out what it is. Let's invoke that on each file:
use File::Find; my @search = @ARGV; @search = qw(.) unless @search; find sub { return if -d $_ or -l $_; my $file_said = `file $_`; if ($file_said =~ /perl/) { print "$File::Find::name: $file_said"; } }, @search;
Hey, look at that. Now we're pulling out just the names that file insists are possibly Perl programs. But this program will slow to a crawl on a large tree. We're reinvoking the file command individually on every file in the tree.
There's a couple of ways to go from here to speed it up. I could save all the filenames to invoke file once at the end of the program:
use File::Find; my @search = @ARGV; @search = qw(.) unless @search; my @list; find sub { return if -d $_ or -l $_; push @list, $File::Find::name; }, @search; for (`file @list`) { if (/perl/) { print; } }
And yes, that sped it up considerably faster, but now we don't get the results until the end of the tree walk, and we'll run into problems if the number of arguments exceeds a comfortable limit for file.
But there's another way. Out in the CPAN (at places such as
search.cpan.org
), we can find the File::MMagic
module, which
apparently is a Perl module derived from the file command created
for the PPT project originally based on code written for Apache to
implement the mod_mime
module, to emulate the standard file
command. Wow. And now I'm going to write a recursive controllable
file-like program on top of that. Will the reuse ever stop? (I
hope not!)
So, what we need from this module is the method called
checktype_filename
which returns back a MIME type (like
text/plain
or image/jpeg
), and perhaps a semicolon and some
additional information. So let's find all the Perl scripts quickly.
First, after a little playing around, I see that the string I'm
looking for has ``executable'' followed by a space, then something
ending in ``perl'' followed by a space and then ``script''. That's a
simple regular expression, so I'll add that at the right place:
use File::Find; use File::MMagic; my $mm = File::MMagic->new; my @search = @ARGV; @search = qw(.) unless @search; my @list; find sub { return if -d $_ or -l $_; my $type = $mm->checktype_filename($_); next unless $type =~ /executable \S+\/perl script/; print "$File::Find::name: $type\n"; }, @search;
Now I know what programs to look at when I upgrade, to see which modules they all use. (Hmm. Sounds like an idea for another column. I'll note that.)
And one last fun one. Let's find all the images in the tree, and then
call Image::Size
(also found in the CPAN) on them to see their
respective sizes. Just a few more tweaks:
use File::Find; use File::MMagic; use Image::Size; my $mm = File::MMagic->new; my @search = @ARGV; @search = qw(.) unless @search; my @list; find sub { return if -d $_ or -l $_; my $type = $mm->checktype_filename($_); next unless $type =~ /^image\//; print "$File::Find::name: $type: "; my ($x, $y, $imgtype) = imgsize($_); if (defined $x) { print "$imgtype: $x x $y\n"; } else { print "error: $imgtype\n"; } }, @search;
And as it turns out, I could have left the File::MMagic
out of this
program, since Image::Size
can cheerfully inform me when it wasn't
called on an image, but you know the old Perl motto: There's More Than
One Way To Do It!
So, next time someone asks you ``what do you have?'', I hope you can answer them with a nice short Perl program now. Until next time, enjoy!