Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Unix Review Column 23 (Dec 1998)
Perl is good at handling text files. Some of the most common files Perl is typically used to handle are these log files that spew out of nearly every tool that does interesting things on your system. Some of my past columns have focused on performing analysis of this data, but let's look at a more mundane problem: simple cleanup.
Let's say a tool is generating an ever-increasing log file, appending
new messages to the end. Most of the file is interesting, but there
are a number of lines that begin warning:
that are equally
uninteresting. So, our job is to remove those lines from the
resulting file.
Let's start with the easiest approach. Assuming the log file has
already been generated as log
, let's filter it into clean-log
,
using a simple read-and-conditionally-print loop:
open IN, "log" or die "Cannot open: $!"; open OUT, ">clean-log" or die "Cannot create: $!"; while (<IN>) { print OUT $_ unless /^warning:/i; }
Here, we have two nice normal open
s, and then a loop. Each time
through the while
loop, a new line ends up in $_
. This is
tested with the regular expression, and if the match fails, the line
gets printed to the output file.
Well, this works pretty nice, but now we have used up nearly twice the
disk space. Let's solve that by adding a renaming operation at the
end. We can also avoid coming up with a new filename by using the
convention that an appended tilde (~
) character means ``a temporary
file or a backup file''.
my $name = "log"; open IN, "<$name" or die "Cannot open: $!"; open OUT, ">$name~" or die "Cannot create: $!"; while (<IN>) { print OUT $_ unless /^warning:/i; } close IN; close OUT; rename "$name~", $name or die "Cannot rename: $!";
Here, I've parameterized the name in $name
, and the output file is
now named that name with an appended tilde. Notice the last few
steps: we're now renaming the temp file over the original file, thus
deleting the original file. This is getting better. I now have a
script I can run that appears merely to make the file shorter, and
contain only what I want!
Hmm. What if this script makes a mistake? It'd be nice to have a
backup of the original data that I could look at for a while just in
case. I could run a diff
on the old and new files to see what
changed, for example. Let's do the steps in the other order: rename
the file first, then generate the selected lines into a new file with
the original name.
my $name = "log"; rename $name, "$name~" or die "Cannot rename: $!"; open IN, "<$name~" or die "Cannot open: $!"; open OUT, ">$name" or die "Cannot create: $!"; while (<IN>) { print OUT $_ unless /^warning:/i; }
Hmm. That looks nicer. Now I have a backup file (named with tilde)
and the new data file. Let's make this even easier to use; there's no
reason to put the filename hardwired into the script. Let's get that
from the commandline (@ARGV
), and let there be many files on the
commandline:
foreach $name (@ARGV) { rename $name, "$name~" or die "Cannot rename: $!"; open IN, "<$name~" or die "Cannot open: $!"; open OUT, ">$name" or die "Cannot create: $!"; while (<IN>) { print OUT $_ unless /^warning:/i; } }
Wow, getting even more powerful, but with more lines of code. But
perhaps I'm actually leading you astray a bit. Larry Wall (the creator
of Perl) probably had to do this kind of text editing enough times
that he taught Perl to do it directly. The inplace editing mode
handles this stuff directly. If $^I
is set (either from within the
program, or by the -i
commandline switch), opening files with the
diamond operator (<>) automatically performs a similar
operation:
$^I = "~"; while (<>) print unless /^warning:/i; }
Ahh. Much easier. And roughly the same operations as above. Notice
I didn't need the loop for @ARGV
; that's implicit in the diamond
operator. The value set for $^I
(normally undef) is added to
the names of the files in @ARGV
to create backup files.
All this renaming and editing presumes that the program is no longer writing to the logfile, and that we can do as we please with the original data. Let's throw a monkey wrench into that picture, to illustrate handling a less-common but equally important environment: file locking.
The UNIX filesystem allows multiple processes to gain write access to a file. But unless there's some way of coordinating the writes to a file, the data will become all intermingled.
The most common way to prevent this intermingling is with a file
lock. In Perl, this is most easily accessed with the flock
operator, named for the system call introduced by the UNIX BSD
developers. Even though the same-named operator doesn't exist on
System V variants of UNIX, the Perl operator maps into the appropriate
underlying operations to perform a compatible operation, so it's
fairly portable.
The basic rules are as follows:
-
Programs that want to read a file should open the file, then immediately use
flock HANDLE, 1
. -
Programs that want to both read and write a file should open the file, then use
flock HANDLE, 2
. -
The call to
flock
will block until the file is available, at which time the requested operations can be performed with some degree of safety. -
When the operation has been completed, release the lock by closing the filehandle. (You can also unlock the filehandle without closing it, but you must know precisely what you are doing. It's easier just to always close the handle.)
Note that locking a file only cooperates with other processes that are also locking that file. If a process so chooses, it can come along, open the file for reading or writing, and have its way with the file. That's why it's called advisory locking. (Some UNIX variants have implemented mandatory locking, but that's not common yet.)
So, let's pretend our tool that's creating the log file is still
writing to it, and that it's a nice tool that flock
's the file
whenever it is really writing. How can we remove those warnings now,
without copying the data somewhere else?
The first and fastest way is to pull the data entirely into memory, rewriting it without those pesky warnings:
my $name = "log"; open LOG, "+<$name" or die "Cannot open $name: $!"; flock LOG, 2; @data = grep !/^warning:/i, <LOG>; seek LOG, 0, 0; trunc LOG, 0; print LOG @data; close LOG;
Here, I've added a plus to the open mode to indicate that the same handle will be used with both reading and writing. (I can't just open the filehandle for writing later, because it would lose the lock that I'm holding on the filehandle.)
After the lock is obtained, we're free to mangle the file in useful
ways. We load up @data
with the lines from the file that are
interesting, then rewrite the file, by seeking to its beginning,
truncating the file to zero length, and then dumping the data down to
the file. The final close
frees up the file so that the other tool
can now obtain a new lock to write into the file again.
Well, that was pretty nifty, but notice that I now needed to have the entire (new) data in memory. For a really huge file, this is bound to be a problem. So, let's edit the file ``in place''. It's a bit tricky, so I'll give you the program then explain how it works:
my $name = "log"; open LOG, "+<$name" or die "Cannot open $name: $!"; flock LOG, 2; my $write_pos = 0; while (<LOG>) { unless (/^warning:/i) { my $read_pos = tell LOG; seek LOG, $write_pos, 0; print LOG $_; $write_pos = tell LOG; seek LOG, $read_pos, 0; } } trunc LOG, $write_pos; close LOG;
Here, I'm maintaining two pointers into the file: a read position, and
a write position. The read position is being maintained
automatically, via the while
loop. The write position initially
starts out as zero, and is remembered in the $write_pos
variable.
Inside the loop, when I see an entry I want to keep, I compute the
current read position via tell
, go to the writing position, write
the value I want to remember, and then return to the reading position.
Once I've gone through the entire file, I can simply truncate it to
the write position, and I'm done.
This works only because I'm making the file shorter, but will work on files of huge lengths, since the most I've actually got in memory is one line.
So, there you have it. Many ways to reduce the amount of data you'll be wading through later. Enjoy.