Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 15 (July 1997)

The acronym ``PERL'' was originally coined to mean ``Practical Extraction and Report Language''. Although Perl has expanded in application areas and grown in capability wildly in the nearly ten years of its life, the basics of getting a report constructed from extracted data are still the basis for many of Perl programming problems.

Let's take a look at a typical problem: analyzing a log file. Log files get created all over the place these days. For example, login/logout records, command invocations, file transfers, mail daemons, gopher servers, and yes, the ever-increasing use of web servers, all generate lines and lines and pages of seemingly unending streams of data. Each individual transaction is probably not worth examining (unless you've had a security violation recently), but the reduction of this data to somewhat meaningful summary reports is an increasingly common task.

Let's look at a hypothetical ``file transfer log'' that looks like so:

    fred wilma 08:50 730
    barney betty 06:15 190
    betty barney 22:27 993
    barney wilma 23:47 504
    fred wilma 04:29 836
    betty betty 14:37 738
    wilma barney 18:47 825

and consists of four space-separated columns containing the source host, destination host, 24-hour clock time, and number of bytes transferred.

I generated some sample data with a little test program:

    my @hosts = qw(fred barney betty wilma);
    srand;
    sub randhost { $hosts[rand @hosts]; }
    for (my $n = 0; $n <= 999; $n++) {
      printf
        "%s %s %02d:%02d %d\n",
        randhost, randhost, rand(24), rand(60), rand(1000);
    }

which will spit out the proper fields. Adjust the ``999'' to adjust the size of the output. (This program uses the 5.004 syntax with very flexible placement of my()... if you get errors on older versions of Perl version 5, remove the my() keywords).

So, now we have some relatively boring data. Let's see how to generate some typical reports. All reports will require that the data be parsed into columns, and that'll be in common for each of the parsing programs. So, let's get that out of the way first.

A skeleton parsing program looks like this:

    while (<>) {
      my ($from, $to, $hh, $mm, $bytes) =
        /^(\S+) (\S+) (\d+):(\d+) (\d+)$/
          or (warn "bad format on line $.: $_"), next;
      # accumulate
    }
    # print the result here

If there were multiple formats in the file, the failed regular expression match could go on to try other combinations. This is useful when there are variations of field data.

So far, the program parses the data, but doesn't accumulate the numbers or print the results. Let's start with something simple: count the total bytes transferred and the number of transfer jobs:

      # for accumulate:
      $total_bytes += $bytes;
      $jobs++;

which goes inside the loop, and then to print it out, outside the loop:

    # for print:
    print "total bytes = $total_bytes, jobs = $jobs\n";

Well, that wasn't tough. Let's try something a little more interesting. Let's see how many bytes came from each host. We'll do it with a hash, in which the key is the name of the host, and the corresponding value is the total count (and keep track of the job count in a separate hash):

      # for accumulate:
      $from_bytes{$from} += $bytes;
      $from_jobs{$from}++;

Now to print this, we have to walk a hash to dump it

    # for print:
    for my $from (sort keys %from_bytes) {
      my $bytes = $from_bytes{$from};
      my $jobs = $from_jobs{$from};
      print "$from sent $bytes bytes on $jobs jobs\n";
    }

Here, the keys of %from_bytes are examined, sorted, and then used one at a time to pull out the corresponding values from the two hashes created in parallel. Hey, now we're getting somewhere.

What if we wanted total bytes transferred, and didn't care whether it was in or out. To do this, I use a slick trick of adding the number into two different places in the hash:

      # for accumulate:
      $total_bytes{$from} += $bytes;
      $total_bytes{$to} += $bytes;

and then we'd walk the %total_bytes hash similar to the code above:

    # for print:
    for my $host (sort keys %total_bytes) {
      my $bytes = $total_bytes{$host};
      print "$host did $bytes\n";
    }

Note that if we were computing a ``grand total'' of bytes, the value would be double here, so be careful when you are doing this. If you wanted to be able to ``grand total'' this number, one step might be to allocate half of each byte-count to each host, as in:

      # for accumulate:
      $bytes /= 2; # allocation correction
      $total_bytes{$from} += $bytes;
      $total_bytes{$to} += $bytes;

There. Now a grand total of this table will show the same as the grand total of the other table. ``How to lie with statistics'', I guess.

So, so far, we have accumulations based on zero data items (the grand total) and based on one data item (like the source host). Can we do accumulations based on two or more data items? Surely, thanks to Perl's ability to have (apparently) nested hashes. Let's look at a two-way table, summarizing all transfers on both the source host and the destination host. That'd look like this:

      # for accumulate:
      $from_to_bytes{$from}{$to} += $bytes;

Yeah. That's it! We're now tracking the source and destination hosts. Dumping this data is a little trickier though. Here's a line-by-line dump for all the combinations of ``from'' and ``to'':

    # for print:
    for my $from (sort keys %from_to_bytes) {
      my $second = $from_to_bytes{$from};
      for my $to (sort keys %$second) {
        my $bytes = $second->{$to};
        print "$from to $to did $bytes\n";
      }
    }

Yeah, a little more complicated, because we have to go through the entire matrix. The outer loop walks through all source hosts, grabbing the inner nested hash reference in $second. This inner hash is then walked via the inner loop, yielding the nested data.

The output of this program is a little ugly, like:

    [...]
    barney to fred did 2792
    barney to wilma did 4683
    betty to barney did 2333
    betty to betty did 2568
    [...]

so let's clean it up a bit, creating an actual square table. First, we have to compute all possible destination hosts for the column headings:

    my %to_hosts = ();
    for my $from (sort keys %from_to_bytes) {
      my $second = $from_to_bytes{$from};
      my @keys = keys %$second;
      @to_hosts{@keys} = ();
    }
    my @to_hosts = sort keys %to_hosts;

Here, I create a temporary hash called %to_hosts which serves as a ``set''. For each of the source hosts, I pull out all the destination hosts into @keys (inside the loop), and then add those members into the ``set''. The final statement extracts the members of that set into an ordinary array @to_hosts, which I'll then use for the column headers and keys. The column headers are printed with:

    printf "%10s:", "bytes to";
    for (@to_hosts) {
      printf " %10s", $_;
    }
    print "\n";

and then the walk through the matrix actually is a bit simpler than the previous example:

    for my $from (sort keys %from_to_bytes) {
      printf "%10s:", $from;
      for my $to (@to_hosts) {
        my $bytes = $from_to_bytes{$from}{$to} || "- none -";
        printf " %10s", $bytes;
      }
      print "\n";
    }

because we don't need to get the keys of the second hash... it's already some part of @to_hosts. The output of this program looks like:

  bytes to:     barney      betty       fred      wilma
    barney:       3303       4429       2792       4683
     betty:       2333       2568       3928       1813
      fred:       2416       3542        226       5293
     wilma:       5267       1196       2706       4580

There you have it... some sample data reduction program snippets. Have fun reducing data!


Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.