Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Unix Review Column 15 (July 1997)
The acronym ``PERL'' was originally coined to mean ``Practical Extraction and Report Language''. Although Perl has expanded in application areas and grown in capability wildly in the nearly ten years of its life, the basics of getting a report constructed from extracted data are still the basis for many of Perl programming problems.
Let's take a look at a typical problem: analyzing a log file. Log files get created all over the place these days. For example, login/logout records, command invocations, file transfers, mail daemons, gopher servers, and yes, the ever-increasing use of web servers, all generate lines and lines and pages of seemingly unending streams of data. Each individual transaction is probably not worth examining (unless you've had a security violation recently), but the reduction of this data to somewhat meaningful summary reports is an increasingly common task.
Let's look at a hypothetical ``file transfer log'' that looks like so:
fred wilma 08:50 730 barney betty 06:15 190 betty barney 22:27 993 barney wilma 23:47 504 fred wilma 04:29 836 betty betty 14:37 738 wilma barney 18:47 825
and consists of four space-separated columns containing the source host, destination host, 24-hour clock time, and number of bytes transferred.
I generated some sample data with a little test program:
my @hosts = qw(fred barney betty wilma); srand; sub randhost { $hosts[rand @hosts]; } for (my $n = 0; $n <= 999; $n++) { printf "%s %s %02d:%02d %d\n", randhost, randhost, rand(24), rand(60), rand(1000); }
which will spit out the proper fields. Adjust the ``999'' to adjust the
size of the output. (This program uses the 5.004 syntax with very
flexible placement of my()... if you get errors on older versions of
Perl version 5, remove the my()
keywords).
So, now we have some relatively boring data. Let's see how to generate some typical reports. All reports will require that the data be parsed into columns, and that'll be in common for each of the parsing programs. So, let's get that out of the way first.
A skeleton parsing program looks like this:
while (<>) { my ($from, $to, $hh, $mm, $bytes) = /^(\S+) (\S+) (\d+):(\d+) (\d+)$/ or (warn "bad format on line $.: $_"), next;
# accumulate
}
# print the result here
If there were multiple formats in the file, the failed regular expression match could go on to try other combinations. This is useful when there are variations of field data.
So far, the program parses the data, but doesn't accumulate the numbers or print the results. Let's start with something simple: count the total bytes transferred and the number of transfer jobs:
# for accumulate: $total_bytes += $bytes; $jobs++;
which goes inside the loop, and then to print it out, outside the loop:
# for print: print "total bytes = $total_bytes, jobs = $jobs\n";
Well, that wasn't tough. Let's try something a little more interesting. Let's see how many bytes came from each host. We'll do it with a hash, in which the key is the name of the host, and the corresponding value is the total count (and keep track of the job count in a separate hash):
# for accumulate: $from_bytes{$from} += $bytes; $from_jobs{$from}++;
Now to print this, we have to walk a hash to dump it
# for print: for my $from (sort keys %from_bytes) { my $bytes = $from_bytes{$from}; my $jobs = $from_jobs{$from}; print "$from sent $bytes bytes on $jobs jobs\n"; }
Here, the keys of %from_bytes are examined, sorted, and then used one at a time to pull out the corresponding values from the two hashes created in parallel. Hey, now we're getting somewhere.
What if we wanted total bytes transferred, and didn't care whether it was in or out. To do this, I use a slick trick of adding the number into two different places in the hash:
# for accumulate: $total_bytes{$from} += $bytes; $total_bytes{$to} += $bytes;
and then we'd walk the %total_bytes hash similar to the code above:
# for print: for my $host (sort keys %total_bytes) { my $bytes = $total_bytes{$host}; print "$host did $bytes\n"; }
Note that if we were computing a ``grand total'' of bytes, the value would be double here, so be careful when you are doing this. If you wanted to be able to ``grand total'' this number, one step might be to allocate half of each byte-count to each host, as in:
# for accumulate: $bytes /= 2; # allocation correction $total_bytes{$from} += $bytes; $total_bytes{$to} += $bytes;
There. Now a grand total of this table will show the same as the grand total of the other table. ``How to lie with statistics'', I guess.
So, so far, we have accumulations based on zero data items (the grand total) and based on one data item (like the source host). Can we do accumulations based on two or more data items? Surely, thanks to Perl's ability to have (apparently) nested hashes. Let's look at a two-way table, summarizing all transfers on both the source host and the destination host. That'd look like this:
# for accumulate: $from_to_bytes{$from}{$to} += $bytes;
Yeah. That's it! We're now tracking the source and destination hosts. Dumping this data is a little trickier though. Here's a line-by-line dump for all the combinations of ``from'' and ``to'':
# for print: for my $from (sort keys %from_to_bytes) { my $second = $from_to_bytes{$from}; for my $to (sort keys %$second) { my $bytes = $second->{$to}; print "$from to $to did $bytes\n"; } }
Yeah, a little more complicated, because we have to go through the entire matrix. The outer loop walks through all source hosts, grabbing the inner nested hash reference in $second. This inner hash is then walked via the inner loop, yielding the nested data.
The output of this program is a little ugly, like:
[...] barney to fred did 2792 barney to wilma did 4683 betty to barney did 2333 betty to betty did 2568 [...]
so let's clean it up a bit, creating an actual square table. First, we have to compute all possible destination hosts for the column headings:
my %to_hosts = (); for my $from (sort keys %from_to_bytes) { my $second = $from_to_bytes{$from}; my @keys = keys %$second; @to_hosts{@keys} = (); } my @to_hosts = sort keys %to_hosts;
Here, I create a temporary hash called %to_hosts which serves as a ``set''. For each of the source hosts, I pull out all the destination hosts into @keys (inside the loop), and then add those members into the ``set''. The final statement extracts the members of that set into an ordinary array @to_hosts, which I'll then use for the column headers and keys. The column headers are printed with:
printf "%10s:", "bytes to"; for (@to_hosts) { printf " %10s", $_; } print "\n";
and then the walk through the matrix actually is a bit simpler than the previous example:
for my $from (sort keys %from_to_bytes) { printf "%10s:", $from; for my $to (@to_hosts) { my $bytes = $from_to_bytes{$from}{$to} || "- none -"; printf " %10s", $bytes; } print "\n"; }
because we don't need to get the keys of the second hash... it's already some part of @to_hosts. The output of this program looks like:
bytes to: barney betty fred wilma barney: 3303 4429 2792 4683 betty: 2333 2568 3928 1813 fred: 2416 3542 226 5293 wilma: 5267 1196 2706 4580
There you have it... some sample data reduction program snippets. Have fun reducing data!