Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 9 (July 1996)

Perl's text processing capabilities provide direct access to a number of high-level facilities for manipulating text. However, sometimes the problem is not what to do with the text once you've got it, but eliminating stuff that doesn't look like text. Let's take a look at how to use Perl's facility for recognizing text files (as opposed to binary files) to ease a typical processing task.

First, let's reconstruct the simplest form of program that emulates the standard Unix grep command.

        #!/usr/bin/perl
        $search = shift;
        $showname = @ARGV > 1;
        while (<>) {
                next unless /$search/o;
                print "$ARGV: " if $showname;
                print;
        }

The first executable line after the #! header takes the first command-line parameter (found in @ARGV) and shifts it off into the $search variable. Here, I take advantage of the fact that the shift operator defaults to @ARGV.

Next, if there is more than one argument remaining, I'll need to remember that into the $showname variable, so that I can print the filename along with the found lines, just like the grep command does. $showname will be true when the number of arguments is greater than one, and false otherwise.

Then comes an ordinary ``diamond-loop'': each line from each file specified on the command line is read into the $_ variable, and the body of the loops gets to take a whack at it.

The first line in the loop looks for $search (interpreted as a Perl regular expression) within the contents of $_. If it isn't found, we go on to the ``next'' line. The ``o'' modifier on the regular expression match is a speed optimization -- without it, the regular expression would have to be ``compiled'' on each turn through the loop, slowing us down needlessly because the regular expression is not changing each time.

The $showname variable is then consulted: if it is true, we need to prefix the output line with the filename in which the line was found. Luckily, Perl provides the filename for us in the $ARGV scalar variable (only coincidentally named like the ARGV filehandle and the @ARGV array).

Finally, whether or not the filename was printed, the current line is printed (from $_, because the print operator defaults to printing $_).

So, this little program emulates the simplest form of the Unix grep command: namely, the invocation like:

        grep regex [file ... ]

I've also been told that this program actually executes faster than the corresponding Vendor-supplied grep invocation for certain regular expressions: the regular expression stuff inside Perl is supposed to be some of the fastest around.

So, why have I rewritten grep? So that I can give it additional functionality! The system-supplied grep command groks through both text files and binary files. Many times, it'd be nice for me to simply say something like:

        textgrep regex *

and have this magical ``textgrep'' look only in the text files, ignoring the binary files that match *. In particular, my personal ``binary'' directory ($HOME/.bin) has a number of binary programs, but many more executable scripts (mostly in Perl :-), and I'd like to be able to:

        cd $HOME/.bin
        textgrep "somestring" *

to grep just the scripts, for example. Another use might be inside a ``build'' directory with a lot of program source and objects. If you're looking for a particular literal string in a source file, you don't also want to hit the binary from which the source was compiled!

So, let's see if we can make our ``grep-in-perl'' become a ``textgrep''. The first step is distinguishing text files from binary files. Perl makes this easy, with its built-in -T operator. This operator returns true if the argument string (a filename) or filehandle represents a ``text file''. Now, Unix doesn't have a simple bit to test if something is ``text'' or ``binary'', so instead, the Perl process grabs a chunk of bytes from the file, and guesses whether it's more likely to be a text file or binary file. It usually guesses right, but occasionally can get fooled.

Because @ARGV contains a series of filenames, it seems natural then to test each one with -T to see if is text-like or binary-ish. We can even do this in a compact way with the grep operator (not to be confused with the Unix grep command although the name was chosen for its similarity).

The grep operator evaluates a block for each element of a list, setting each element into $_ temporarily. For those elements where the block returns true, the elements are retained in the return value, so at first glance, we can just write:

        @ARGV = grep { -T } @ARGV;

However, this fails in the face of the possibility of having ``-'' somewhere in the @ARGV list. The ``-'' is supposed to mean ``read from standard input here'' which hopefully is a textfile. However, -T will reject this as a non-existing file, so we have to special-case it to ensure that it survives the preening. Not too difficult, but it now looks like:

        @ARGV = grep { -T or $_ eq "-" } @ARGV;

Nice! @ARGV now contains only textfiles! Only one other weird case to deal with now. If the original list contains only binaries, the new @ARGV is now empty. We can't have that, because that would mean that the program reads from standard input, even though the user originally supplied a series of names on the command line. Take a look at how I handled it in the finished program:

        #!/usr/bin/perl
        $search = shift;
        $showname = @ARGV > 1;
        @ARGV = "-" unless @ARGV;
        @ARGV = grep { -T or $_ eq "-" } @ARGV;
        exit 0 unless @ARGV;
        while (<>) {
                next unless /$search/o;
                print "$ARGV: " if $showname;
                print;
        }

Notice that the fourth line replaces an empty @ARGV with an @ARGV consisting solely of a list containing ``-''. This doesn't change its meaning at all, but it does allow me to test later whether we've reduced an original list to nothing-ness.

The following line is the ``text file only'' reduction as described above. The line after that terminates the program if the list is now empty, because there's nothing to scan! The remainder of the program is identical to the previous version.

We now have a program that could be called ``textgrep''. It has no arguments, although it understands ``-'' in the list of files to mean ``standard input'', and presumes that standard input is always a text file.

Let's take it further. The standard Unix grep command has a ``-l'' (lowercase L, not capital I) option which says to simply list the matching filenames one at a time, rather than the lines that match. This is useful to perform further operations. For example, to edit all the files in the current directory that match ``fred'', I could say:

        vi `grep -l fred *`

Or to move them all to the ../freds directory,

        mv `grep -l fred *` ../freds

So, let's give ``textgrep'' this same functionality. Here's the code:

        #!/usr/bin/perl
        $names++, shift if $ARGV[0] eq "-l";
        $search = shift;
        $showname = @ARGV > 1;
        @ARGV = "-" unless @ARGV;
        @ARGV = grep { -T or $_ eq "-" } @ARGV;
        exit 0 unless @ARGV;
        while (<>) {
                next unless /$search/o;
                if ($names) {
                        print "$ARGV\n";
                        close ARGV;
                } else {
                        print "$ARGV: " if $showname;
                        print;
                }
        }

Only half a dozen additional lines. Let's see: the first line after the comment examines $ARGV[0] (the first argument). If this is equal to ``-l'', I want to invoke ``names-only'' mode, so I set $names, and shift away the ``-l''. Hopefully, anything else is a regular expression, captured in the following line.

The other change is within the body of the diamond loop. Note that when a line is found, $names is checked again. If true, the program prints the name of the file followed by a newline, and then closes ARGV. The point of closing ARGV is that the diamond operator will automatically advance to the next file when we get to the top of the loop. Once a matching line is found, there's no point in checking the rest of the file. This also ensures that a given filename will be shown only once, no matter how many potential matches there are in the file.

If $names is false, the ``-l'' option wasn't seen, so the behavior from the previous textgrep program is selected in the ``else'' block.

With this program, I can then examine just the textfiles. For example, edit only the textfiles that contain ``fred'';

        vi `textgrep -l fred *`

Or even, send all text files (but just textfiles) to the lineprinter:

        pr `textgrep -l '^' *` | lpr -Pslatewriter

Note here that ``^'' matches the beginning of the line, which is normally true for every file that grep sees, but remember that textgrep rejects non-text files!

As you can see, with about a dozen lines of Perl code, I've recreated a very common Unix utility, and even given it additional functionality. I hope you've enjoyed this little bit of text (manipulation). See ya next time!

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Unix Review Column 9 (July 1996)