Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 23 (Dec 1998)

Perl is good at handling text files. Some of the most common files Perl is typically used to handle are these log files that spew out of nearly every tool that does interesting things on your system. Some of my past columns have focused on performing analysis of this data, but let's look at a more mundane problem: simple cleanup.

Let's say a tool is generating an ever-increasing log file, appending new messages to the end. Most of the file is interesting, but there are a number of lines that begin warning: that are equally uninteresting. So, our job is to remove those lines from the resulting file.

Let's start with the easiest approach. Assuming the log file has already been generated as log, let's filter it into clean-log, using a simple read-and-conditionally-print loop:

        open IN, "log"
          or die "Cannot open: $!";
        open OUT, ">clean-log"
          or die "Cannot create: $!";
        while (<IN>) {
          print OUT $_ unless /^warning:/i;
        }

Here, we have two nice normal opens, and then a loop. Each time through the while loop, a new line ends up in $_. This is tested with the regular expression, and if the match fails, the line gets printed to the output file.

Well, this works pretty nice, but now we have used up nearly twice the disk space. Let's solve that by adding a renaming operation at the end. We can also avoid coming up with a new filename by using the convention that an appended tilde (~) character means ``a temporary file or a backup file''.

        my $name = "log";
        open IN, "<$name"
          or die "Cannot open: $!";
        open OUT, ">$name~"
          or die "Cannot create: $!";
        while (<IN>) {
          print OUT $_ unless /^warning:/i;
        }
        close IN;
        close OUT;
        rename "$name~", $name
          or die "Cannot rename: $!";

Here, I've parameterized the name in $name, and the output file is now named that name with an appended tilde. Notice the last few steps: we're now renaming the temp file over the original file, thus deleting the original file. This is getting better. I now have a script I can run that appears merely to make the file shorter, and contain only what I want!

Hmm. What if this script makes a mistake? It'd be nice to have a backup of the original data that I could look at for a while just in case. I could run a diff on the old and new files to see what changed, for example. Let's do the steps in the other order: rename the file first, then generate the selected lines into a new file with the original name.

        my $name = "log";
        rename $name, "$name~"
          or die "Cannot rename: $!";
        open IN, "<$name~"
          or die "Cannot open: $!";
        open OUT, ">$name"
          or die "Cannot create: $!";
        while (<IN>) {
          print OUT $_ unless /^warning:/i;
        }

Hmm. That looks nicer. Now I have a backup file (named with tilde) and the new data file. Let's make this even easier to use; there's no reason to put the filename hardwired into the script. Let's get that from the commandline (@ARGV), and let there be many files on the commandline:

        foreach $name (@ARGV) {
          rename $name, "$name~"
            or die "Cannot rename: $!";
          open IN, "<$name~"
            or die "Cannot open: $!";
          open OUT, ">$name"
            or die "Cannot create: $!";
          while (<IN>) {
            print OUT $_ unless /^warning:/i;
          }
        }

Wow, getting even more powerful, but with more lines of code. But perhaps I'm actually leading you astray a bit. Larry Wall (the creator of Perl) probably had to do this kind of text editing enough times that he taught Perl to do it directly. The inplace editing mode handles this stuff directly. If $^I is set (either from within the program, or by the -i commandline switch), opening files with the diamond operator (<>) automatically performs a similar operation:

        $^I = "~";
        while (<>)
          print unless /^warning:/i;
        }

Ahh. Much easier. And roughly the same operations as above. Notice I didn't need the loop for @ARGV; that's implicit in the diamond operator. The value set for $^I (normally undef) is added to the names of the files in @ARGV to create backup files.

All this renaming and editing presumes that the program is no longer writing to the logfile, and that we can do as we please with the original data. Let's throw a monkey wrench into that picture, to illustrate handling a less-common but equally important environment: file locking.

The UNIX filesystem allows multiple processes to gain write access to a file. But unless there's some way of coordinating the writes to a file, the data will become all intermingled.

The most common way to prevent this intermingling is with a file lock. In Perl, this is most easily accessed with the flock operator, named for the system call introduced by the UNIX BSD developers. Even though the same-named operator doesn't exist on System V variants of UNIX, the Perl operator maps into the appropriate underlying operations to perform a compatible operation, so it's fairly portable.

The basic rules are as follows:

Programs that want to read a file should open the file, then immediately use flock HANDLE, 1.
Programs that want to both read and write a file should open the file, then use flock HANDLE, 2.
The call to flock will block until the file is available, at which time the requested operations can be performed with some degree of safety.
When the operation has been completed, release the lock by closing the filehandle. (You can also unlock the filehandle without closing it, but you must know precisely what you are doing. It's easier just to always close the handle.)

Note that locking a file only cooperates with other processes that are also locking that file. If a process so chooses, it can come along, open the file for reading or writing, and have its way with the file. That's why it's called advisory locking. (Some UNIX variants have implemented mandatory locking, but that's not common yet.)

So, let's pretend our tool that's creating the log file is still writing to it, and that it's a nice tool that flock's the file whenever it is really writing. How can we remove those warnings now, without copying the data somewhere else?

The first and fastest way is to pull the data entirely into memory, rewriting it without those pesky warnings:

        my $name = "log";
        open LOG, "+<$name"
          or die "Cannot open $name: $!";
        flock LOG, 2;
        @data = grep !/^warning:/i, <LOG>;
        seek LOG, 0, 0;
        trunc LOG, 0;
        print LOG @data;
        close LOG;

Here, I've added a plus to the open mode to indicate that the same handle will be used with both reading and writing. (I can't just open the filehandle for writing later, because it would lose the lock that I'm holding on the filehandle.)

After the lock is obtained, we're free to mangle the file in useful ways. We load up @data with the lines from the file that are interesting, then rewrite the file, by seeking to its beginning, truncating the file to zero length, and then dumping the data down to the file. The final close frees up the file so that the other tool can now obtain a new lock to write into the file again.

Well, that was pretty nifty, but notice that I now needed to have the entire (new) data in memory. For a really huge file, this is bound to be a problem. So, let's edit the file ``in place''. It's a bit tricky, so I'll give you the program then explain how it works:

        my $name = "log";
        open LOG, "+<$name"
          or die "Cannot open $name: $!";
        flock LOG, 2;
        my $write_pos = 0;
        while (<LOG>) {
          unless (/^warning:/i) {
            my $read_pos = tell LOG;
            seek LOG, $write_pos, 0;
            print LOG $_;
            $write_pos = tell LOG;
            seek LOG, $read_pos, 0;
          }
        }
        trunc LOG, $write_pos;
        close LOG;

Here, I'm maintaining two pointers into the file: a read position, and a write position. The read position is being maintained automatically, via the while loop. The write position initially starts out as zero, and is remembered in the $write_pos variable. Inside the loop, when I see an entry I want to keep, I compute the current read position via tell, go to the writing position, write the value I want to remember, and then return to the reading position. Once I've gone through the entire file, I can simply truncate it to the write position, and I'm done.

This works only because I'm making the file shorter, but will work on files of huge lengths, since the most I've actually got in memory is one line.

So, there you have it. Many ways to reduce the amount of data you'll be wading through later. Enjoy.

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Unix Review Column 23 (Dec 1998)